Sign In to Follow Application
View All Documents & Correspondence

Patterns Of Composing Music Using Deep Learning

Abstract: Generating a complex work of art such as a musical composition requires exhibiting true creativity that depends on a variety of factors that are related to the hierarchy of musical language. Music generation have been faced with Algorithmic methods and recently, with Deep Learning models that are being used in other fields such as Computer Vision. In this paper we want to put into context the existing relationships between AI-based music composition models and human musical composition and creativity processes. We give an overview of the recent Deep Learning models for music composition and we compare these models to the music composition process from a theoretical point of view. We have tried to answer some of the most relevant open questions for this task by analyzing the ability of current Deep Learning models to generate music with creativity or the similarity between AI and human composition processes, among others.

Get Free WhatsApp Updates!
Notices, Deadlines & Correspondence

Patent Information

Application #
Filing Date
17 October 2022
Publication Number
42/2022
Publication Type
INA
Invention Field
ELECTRONICS
Status
Email
registrar@geu.ac.in
Parent Application

Applicants

Registrar
Graphic Era Deemed to be University, Dehradun, Uttarakhand 248002, India.

Inventors

1. Dr. Surendra Kumar Shukla
Associate Professor, Department of Computer Science & Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand India, 248002
2. Dr. Kumud Pant
Associate Professor, Department of Biotechnology, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India, 248002
3. Mr. Navin Garg
Associate Professor, Department of Computer Science & Engineering, Graphic Era Hill University, Dehradun, Uttarakhand India, 248002
4. Dr. Vikas Tripathi
Associate Professor, Department of Computer Science & Engineering, Graphic Era Deemed to be University, Dehradun, Uttarakhand India, 248002

Specification

FIELD OF THE INVENTION
This invention relates to the convolutional neural networks to process music, in particular
toward discovering latent embeddings in music based on hyper-images extracted from raw
audio waves.
BACKGROUND OF THE INVENTION
Music is generally defined as a succession of pitches or rhythms, or both, in some definite
patterns. Music composition (or generation) is the process of creating or writing a new piece
of music. The music composition term can also refer to an original piece or work of music.
Music composition requires creativity which is the unique human capacity to understand and
produce an indefinitely large number of sentences in a language, most of which have never
been encountered or spoken before [2]. This is a very important aspect that needs to be taken
into account when designing or proposing an AI-based music composition algorithm. More
specifically, music composition is an important topic in the Music Information Retrieval
(MIR) field. It comprises subtasks such as melody generation, multi-track or multi-instrument
generation, style transfer or harmonization. These aspects will be covered in this paper from
the point of view of the multitude of techniques that have flourished in recent years based on
AI and DL.
SUMMARY OF THE INVENTION
Overall, we have implemented a variety of deep learning methods to the music generation
problem with varying levels of success. Our baseline method utilized a recurrent neural
network model for both a single track and multiple tracks. While this model found more
success with regards to the musicality of the produced notes, it was very limited in utility as it
could only produce notes on the quarter note beats. We then moved on to a convolutional
neural network model, using a vanilla CNN to produce the piano track, and a conditional CNN
that used the piano track to produce the other instrument tracks. We found the arrangements
produced by the CNN models to be much more well-formed and coherent, because we were
using conditional models. The novel VAE-based architecture that we devised was the most
successful contribution for our project. By encoding sequences into a latent space using a
VAE, we can then add noise in the latent space to increase the variation of the generated
output in a controllable way, while maintaining the similarity between previous sequence,
ultimately improving the uniqueness of our generated music.

BRIEF DESCRIPTION OF THE INVENTION
Music generation using deep learning techniques has been a topic of interest for the past two decades. Music proves to be a different challenge compared to images, among three main dimensions: Firstly, music is temporal, with a hierarchical structure with dependencies across time. Secondly, music consists of multiple instruments that are interdependent and unfold across time. Thirdly, music is grouped into chords, arpeggios and melodies — hence each time-step may have multiple outputs. However, audio data has several properties that make them familiar in some ways to what is conventionally studied in deep learning (computer vision and natural language processing, or NLP). The sequential nature of the music reminds us of NLP, which we can use Recurrent Neural Networks for. There are also multiple ‘channels’ of audio (in terms of tones, and instruments), that are reminiscent of images that Convolutional Neural Networks can be used for. Additionally, deep generative models are exciting new areas of research, with the potential to create realistic synthetic data. Some examples are Variational Autoencoders (VAEs) and Generative Adversarial Neworks (GANs), as well as language models in NLP. Most early music generation techniques have used Recurrent Neural Networks (RNNs), which naturally incorporates dependencies across time. It has used LSTMs to generate single-instrument music in the same fashion as language models. This same method was used, who adapted this to generate lo-fi music.
Recently, Convolutional Neural Networks (CNNs) have been used to generate music with great success, with DeepMind in 2016 showing the effectiveness of WaveNet, which uses dilated convolutions to generate raw audio. Yang (2017) created MidiNet, which uses Deep Convolutional Generative Adversarial Networks (DCGANs) to generate multi-instrument music sequences that can be conditioned both on the previous bar’s music, as well as the chord of the current bar. The GAN concept was taken further by Dong in 2017 with MuseGAN, which uses multiple generators to achieve synthetic multi-instrument music that respects dependencies between instruments. Dong used the Wasserstein-GAN with Gradient Penalty (WGAN-GP) for greater training stability. Lastly, as the latest advances in NLP have been made with attention networks and transformers, attempts have been similarly made to apply transformers to music generation. Shaw (2019) created MusicAutobot, which uses a combination of BERT, Transformer-XL and Seq2Seq to create a multi-task engine that can both generate new music as well as create harmony conditional on other instrument, a collection of 174,154 multitrack pianorolls derived from the Lakh MIDI Dataset, and was curated by the Music and AI Lab at the Research Center for IT Innovation, Academia Sinica. We used the LPD-5 version of the dataset, which includes tracks for piano, drums, guitar, bass, and strings, allowing us to generate complex and rich music and to demonstrate the ability of our generative models to arrange music across different instruments. We used the cleansed subset of the Lakh Pianoroll Dataset, which includes 21,245 MIDI files. Each of the files had corresponding metadata, allowing us to determine information about each file such as the artist and title name. To establish a baseline of music generation that we can improve on, we used recurrent neural networks (RNN), an existing and easily-replicable method. Generating music is formulated as a next-note prediction problem. (This method is very similar to recurrence-based language models that are used in NLP.
This would allow us to generate as much music as we wanted by continuously passing in the note that was generated back into the model. In terms of implementation, we used the Gated Recurrent Unit (GRU) instead of the vanilla RNN, because of its better ability to retain long-term dependencies. Each GRU would take in the previous layer’s activation and output as input, and the output would be the next note given the previous activation and input. To create the data needed to train our recurrent neural network, we first parsed the piano notes of our dataset, representing each file as a list of notes found in the file. We then created the training input sequences by taking subsets of the list representation for each song and created the corresponding training output sequences by simply taking the next note of each subset. With this training input and output, the model would be trained to predict the next note, which would then allow us to pass in any sequence of notes and get a prediction of the next note. Each input sequence was passed into an embedded layer that created embeddings of a size of 96. This embedding was then passed into a gated recurrent unit with a single layer, which was then passed to a fully connected layer to output a probability distribution of the next note. We could pick the note with the highest probability as the next predicted note, but that would lead to deterministic sequences with no variation. Hence, we sample the next note from a multinomial distribution with the output probabilities. While the RNN next-note prediction model is easy and clean to implement, the generated music sounds far from ideal and there is very limited utility. Because we encode every single note into a token and predict a probability distribution over the encodings, we can only really do this for one instrument, because for multiple instruments, the number of combinations of notes increases exponentially. Also, the assumption that every note is of the same length definitely does not reflect most musical works.
MULTI-INSTRUMENT RNN
Hence, we sought to explore other methods to generate music for multiple instruments at the same time, and came up with the multi-instrument RNN. Instead of encoding the music into unique notes/chords like we did in the initial idea, we worked directly with the 5 x 128 multi-instrument piano roll at each time-step, flattening it to become a 640-dimensional vector that represents the music at every time-step. Then, we trained an RNN to predict the next time-step’s 640-dimensional vector, given the previous length-32 sequence of 640-dimensional vectors. While this method would theoretically make sense, it was challenging to produce satisfactory results due to the difficulty in generating variety that was complementary across all the instruments. In the single-instrument set-up, we sampled from a multinomial distribution with probability weighted by the output softmax scores to generate the next note. However, since all instruments are placed together in the 640-dimensional vector, generating the next note using softmax-ed scores over the entire 640d vector could mean that some instruments could potentially have multiple notes while some instruments have none. We attempted to solve this problem by running the softmax function separately for each of the 5 instrument’s 128-dimensional vectors, so we could ensure we generate a certain number of notes for each instrument. However, this meant that the sampling for each instrument was independent from each other. This means that the generated piano sequence would not be complementary to the other instruments’ sequences. For example, if the C-E-G chord is sampled from the sequence, the bass’ has no way of incorporating this, and could sample the D-F-A chord, which is harmonically dissonant and not complementary. Moreover, there was the problem of not knowing how many notes to be sampled for each instrument at each time. This problem was not present in the single-instrument set-up because single notes and multi-note chords are all encoded as integer representations. We addressed this issue by sampling a specified number of notes for each time step (e.g. 2 for piano, 3 for guitar) from the multinomial. But this was unsuccessful as the generated music sounded highly random and unmusical.
MOVING FROM RECURRENT TO CONVOLUTIONAL
From this point on, we decided to focus on Convolutional Neural Networks rather than RNNs to generate sequences of music. The CNN would directly generate a length-32 sequence by outputting a 5 x 32 x 128 3-dimensional tensor. This would solve the problem of not knowing how many notes to generate and having to use multinomial sampling. CNN architectures, such as WaveNet, have been shown to achieve just as good, if not better performance as RNNs in sequence generation. Additionally, they are much faster to train due to performance optimizations with convolutional operations. In order to generate multiple instrument tracks compatible with each other, we tried a two-part generation model that comprises a MelodyCNN for next-time-step melody generation, as well as a Conditional-HarmonyCNN to generate the non-piano instruments, given the melody for the same time-step as well as that instrument’s music for the last time step. Since the input and output sizes are the same (32 x 128), the MelodyCNN architecture used was symmetric, with 3 convolutional layers, 3 dense layers and 3 deconvolutional layers. The Conditional HarmonyCNN used 3 convolutional layers for each of the inputs (piano as well as instrument’s previous), then concatenated the resulting tensors before passing through dense and deconvolutional layers. Hence, the MelodyCNN learns a mapping between piano sequences in successive time steps, while the Conditional HarmonyCNNs maps from piano music space to the other instruments. Using the 5 CNNs in total (one for each instrument), new music can be generated iteratively given a starting multi-instrument sequence. First, MelodyCNN is used to predict the next piano sequence, and the Conditional HarmonyCNNs are used to predict the other instruments. This framework was successful in generating multi-instrument music sequences where the instruments sound musically complementary. However, varying the starting sequence from which the music is generated only led to very little variation in the generated music, as shown in the pianoroll above: the three generated sequences are nearly identical to each other. This shows that the CNNs likely converged on outputting only a small subset of common sequences in the training data that minimized the training loss. Another method needs to be found to generate some variety in the output music, given the same input, and to achieve this, we turn to VAEs.
USING VARIATIONAL AUTOENCODERS (VAES)
A variational autoencoder (VAE) is an autoencoder in which training is regularized to ensure that the latent space has good properties that allow a generative process. Two such properties are continuity close points in latent space should give similar points once decoded, and completeness a point sampled from latent space should give meaningful content once decoded. A vanilla autoencoder encodes the inputs into a vector in latent space, but has no guarantee that the latent space satisfies continuity and completeness that allow new data to be generated. In contrast, a VAE encodes an input as a distribution over latent space. Specifically, we assume the latent distribution to be distributed Gaussian, hence the encoder encoding a distribution is equivalent to the encoder outputting the mean and standard deviation parameters of the normal distribution. To train the VAE, a two-term loss function is used: a reconstruction error (difference between decoded outputs and inputs), as well as a regularization term (KL-divergence between the latent distribution and standard Gaussian) to regularize the latent distribution to be as close to standard normal as possible. We hence apply VAEs to the music generation task.
The previous piano input is encoded by the piano VAE into a latent piano encoding of dimension K, z. Then, random noise is added to the encoded latent distribution’s mean parameters. The standard deviation of this random noise is a hyperparameter that the user can tune based on the amount of variation he desires. The latent parameters z? are then input to MelodyNN, a Multi-Layer Perceptron that learns a mapping from the previous piano sequence’s latent distribution to the next piano sequence’s latent distribution. The output z_t+1 is then decoded to become generated next piano output. Instrument-specific VAEs are also trained on the other four instruments (guitar, bass, strings, drums). Then, similar to the ConditionalCNN earlier, we use a ConditionalNN, another MLP that takes in the generated next-period piano latent parameters as well as the previous-period guitar latent parameters z_t+1, and learns a mapping to the next-period guitar latent parameters w_t+1. w_t+1 is then decoded by the instrument-specific VAE’s decoder to produce the next-period guitar output. 4 ConditionalNNs are trained, one for each non-piano instrument, which allow the next 5-instrument sequence to be generated. Hence, by mapping the musical inputs into latent distributions with VAEs, we can introduce variation to the generated music output by adding random noise to the encoded latent distribution’s parameters. Due to continuity, this ensures that after adding random noise, the decoded inputs are similar yet different from the original inputs, and due to completeness, it ensures that they give meaningful music outputs that are similar to the input music distribution. VAEs of latent dimensionality 8, 16, 32 and 64 were trained. In the end, a 16-dimensional latent space was used to train the conditional NNs, since the music samples are relatively sparse in music space. After training the conditional NNs, we find that the VAE+NN method is successful in creating multi-instrument outputs that sound coherent, as well as having appropriate amounts of variation to be aesthetically pleasing. Random noise of standard deviations between 0.5 to 1.0 were found to generate the best amount of variation. The VAE-NN framework explained above allows us a straightforward method to generate music based on specific styles, such as a certain artist, genre, or year. For example, if we wanted to generate music in the style of Thriller by Michael Jackson, we could: Break the song into 32-step sequences and encode each sequence’s pianoroll into the latent space using each instrument’s VAE encoder. Store the unique sequences in a set for each instrument. When generating music from a starting sequence, one latent vector per instrument is sampled from this set. This sampled latent vector (from our desired song) s is then interpolated with the previous sequence’s latent vectorto generate a new latent vector,

with a being the latent sample factor, which is a hyperparameter that can be tuned. (Choose higher values of a to the generated music to be more significantly conditioned on the desired style) Use z’ instead of z as the input for the MelodyNN to generate the new latent vector and hence the generated piano sequence. Using this method and a=0.5, we generated new music conditioned on several songs, some examples being Thriller by Michael Jackson and I Want It That Way by Backstreet Boys. This was successful in generating audio samples that have some similarity to the original song, but with some variation as well. (Once again, the extent of variation can be tuned with the noise_sd hyperparameter as well). One can even generate music based on samples which are a hybrid of different artists or styles, hence allowing music enthusiasts to synthesize music combining the styles of different music stars.
The Transformer-XL architecture provides techniques to solve these issues. First, it has a segment-level recurrence mechanism. While training, representations computed for the previous segment are cached such that they can be used as an extended context when the model processes the next segment. Thus, information can now flow through segment boundaries, and solves the context fragmentation problem as well. Secondly, it has a relative positional encoding scheme. This allows the model to understand not only the absolute position of each token, but the position of each token relative to one another as well, which is extremely important in music.

We Claims:

1. In terms of implementation, we used the Gated Recurrent Unit (GRU) instead of the vanilla RNN, because of its better ability to retain long-term dependencies.
2. The sequential nature of the music reminds us of NLP, which we can use Recurrent Neural Networks for.
3. Each GRU would take in the previous layer’s activation and output as input, and the output would be the next note given the previous activation and input.
4. Two such properties are continuity close points in latent space should give similar points once decoded, and completeness a point sampled from latent space should give meaningful content once decoded.
5. Most early music generation techniques have used Recurrent Neural Networks (RNNs), which naturally incorporates dependencies across time.
6. It has used LSTMs to generate single-instrument music in the same fashion as language models. This same method was used, who adapted this to generate lo-fi music.
7. While training, representations computed for the previous segment are cached such that they can be used as an extended context when the model processes the next segment.

Documents