Jukebox is a machine learning framework, developed by OpenAI, capable of generating music as raw audio in a range of genres and musical styles. Provided with a genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch.
Synthesizing songs at the audio level is difficult because of the length of the sequences: a typical 4-minute song at CD quality (44 kHz, 16-bit) contains over 10 million timesteps. As a result, mastering the high-level semantics of music necessitates models that can cope with extremely long-term dependencies.
Jukebox tackles this by using what’s called an autoencoder, a type of neural network that compresses raw audio into a lower-dimensional space by getting rid of unnecessary information. The model then learns how to produce new music in this compressed area, which it then upsampled back to the original raw audio space.
Jukebox employs a form of autoencoder known as Vector Quantized Variational AutoEncoder (VQ-VAE). This model compresses raw audio 8 times, 32 times, and 128 times. The bottom-level encoding (8 times) yields the best reconstruction in the form of "music codes," but the top-level encoding (128 times) maintains just the most important musical information, such as pitch, timbre, and volume.
Jukebox employs a range of prior models to produce music in the compressed space. These models learn the distribution of music codes and use them to make music. Because the top-level prior reflects the long-term structure of music, samples decoded from it have inferior audio quality but capture high-level semantics such as singing and melodies. The middle and bottom upsampling priors increase audio quality by adding local musical structures such as timbre.
In brief, Jukebox compresses raw audio with an autoencoder into a lower-dimensional space, learns to make music in this area, and then upsamples it back to the raw audio space. It compresses audio at various levels with VQ-VAE and generates music in the compressed space with a series of earlier models.
To train the model, OpenAI scraped the web to curate a dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki. The metadata includes artist, album genre, and year of the songs, along with common moods or playlist keywords associated with each song. The developing team trained the model on 32-bit, 44.1 kHz raw audio and performed data augmentation (techniques used to artificially increase the size of a dataset by applying various transformations to the existing data in order to improve model performance and prevent overfitting.) by randomly downmixing the right and left channels to produce mono audio.
OpenAI's researchers have achieved substantial success in developing the Jukebox model, which represents a major advance in terms of the quality, coherence, and duration of the created musical samples. Despite these advancements, there is still a significant difference between these machine-generated songs and those made by human artists.
Although Jukebox may generate local musical coherence and follow standard chord patterns, broader musical structures like as choruses are typically missing. Furthermore, the downsampling and upsampling processes can create discernible noise, prompting the researchers to investigate ways for reducing this noise and improving the quality of the resulting music.
To overcome these constraints, the OpenAI team is focusing on improving the Vector Quantized Variational AutoEncoder (VQ-VAE) model used by Jukebox in order to capture more musical information while minimizing noise during the downsampling and upsampling processes. They are also looking towards ways to accelerate the autoregressive nature of sampling, which currently limits Jukebox's use in interactive applications. One possible option is to convert the model into a parallel sampler, which might significantly accelerate the sampling process.
Another critical area of effort for the OpenAI team is expanding the number of languages and musical genres supported by Jukebox. While the model is presently trained mostly on English lyrics and Western music, the researchers want to incorporate songs from other languages and regions of the world in the future.
Despite the gap between machine-produced and human-made music, the dispute over intellectual property rights and ownership of generated music continues. While Jukebox was supposed to produce original music that does not infringe on existing copyrights, concerns have been expressed about the technology's possible abuse. There are also concerns regarding who owns the music made by Jukebox and how these rights will be maintained and preserved.
To sum up, OpenAI's Jukebox is an impressive tool for creating synthetic music in a variety of genres and musical styles. It compresses raw audio with an autoencoder into a lower-dimensional space and then learns to generate new music in this compressed space. The model was trained on a 1.2 million song dataset and shown considerable improvements in terms of the quality, coherence, and duration of the synthesized musical samples. However, there are still obstacles to solve, such as lowering noise during the downsampling and upsampling procedures, as well as increasing the number of supported languages and musical genres. There is also an ongoing discussion about intellectual property rights and ownership of created music. Overall, Jukebox marks a significant advancement in the realm of music creation, and it will be interesting to see how it evolves in the future.