GPT-2 Artificial Intelligence Song Generator: Let’s Get Groovy

20 January 2020, 15:38

This blog-post will show how we can use NLP’s magic to generate our own song lyrics. We will use the power of GPT-2, a large pre-trained model developed by OpenAI. The codebase can be found in this notebook (it’s in view only mode but you can make a copy to run it).

Special thanks to Thomas DehaeneKoen VerschaerenMats Uytterhoeven and Anna Krogager for their contribution.

I. The model

GPT-2 has been the cool kid on the block of NLP models since its release in February 2019. It’s a pre-trained model, trained over a large database to simply predict the next word of a sentence, and can now be fine-tuned on a smaller database to solve different problems. In our case, generating a bunch of legendary songs. It uses the transformer network architecture (introduced by Google in 2017), based on attention mechanisms.

Though the core strength of GPT-2 lies in the gigantic amount of data used to train it. The OpenAI team used around 40 GB worth of text extracted from the internet (i.e. outgoing links from Reddit posts with a karma of at least 3). In comparison, all of Shakespeare’s work combined has an estimated size of about 5.6MB. They trained 4 different models with respectively 124M, 355M, 774M and 1.5B parameters (more parameters leads to more complexity, larger size of the model and higher performance).

II. The data

We used a music lyric database extracted from this Kaggle Kernel. It contains 57.650 music lyrics, mostly in english. They were originally scraped from LyricsFreak. There are a lot of fun analyses to be performed on this data, such as sentiment analysis or calculating which artists have the widest vocabulary (spoiler alert, it’s the Wu-Tang Clan, with the Backstreet Boys being the most shallow of them all).

III. The implementation

We implemented the code in Colaboratory, a tool from Google to run Python notebooks in your web browser. It’s free and lets you use GPU-accelerated processing! This will make your life easier with the model fine tuning.

First, we load our libraries. We will use the gpt-2-simple library to conveniently play around with GPT-2. Note that Tensorflow 2.0 removed the “contrib” feature, which is needed in gpt-2-simple. Therefore, we force Colab to use Tensorflow 1.x with the %tensorflow_version 1.x command.

import os
import pandas as pd

%tensorflow_version 1.x

  import gpt_2_simple as gpt2
  !pip3 -q install gpt-2-simple
  import gpt_2_simple as gpt2

We then load our data. It was saved into a public Github repository to make it more convenient to be read into Colab:

dfs = []
link = (''
for i in range(4):

df = pd.concat(dfs).reset_index(drop=True)

We then save this file on our VM’s directory, under a “content” folder. GPT-2 will directly load it from there. You can see the files saved on your VM on the left hand side of your Colab notebook, under the folder tab.

if not os.path.exists('content'):
pd.DataFrame({"lyrics": df['text']})\
    .to_csv(os.path.join('content', 'lyrics.csv'), index=False)

Next, we download the GPT-2 models. Note that you can restrict this list to the model(s) you will be using.

for model_name in ["124M","355M","774M"]:  # Choose from ["124M","355M","774M"]
    gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/124M/

Now we set some hyperparameters and start a session.

learning_rate = 0.0001
optimizer = 'adam' # adam or sgd
batch_size = 1
model_name = "774M" # Has to match one downloaded locally
sess = gpt2.start_tf_sess()

Next, we start the fine tuning. It can take a while (e.g. about 22min for 500 steps with the 774M model). Some useful parameters:

  • restore_from: Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint
  • sample_every: Number of steps to print example output
  • print_every: Number of steps to print training progress
  • learning_rate: Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have <1MB input data)
  • run_name: Subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model)
  • overwrite: Set to True if you want to continue finetuning an existing model (w/ restore_from='latest') without creating duplicate copies
  • steps: How many training steps to be performed.
              steps=500)   # max number of training steps

The artist is officially born…

You will see that the model prints a couple of sample generated texts every print_every steps. It is very nice to see how it evolves.

Something to keep in mind here is that your Colab session can expire during the training (if using a large dataset or number of steps). However, you can retrieve your previous model with the restore_from feature.

Once your model is trained, you can use the gpt2.generate command to produce some lyrics.

    temperature=0.8, # change me
    top_p=0.9, # Change me

for res in lst_results:
  print('\n -------//------ \n')

You can pass a prefix into the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

The nsamples parameters allows you to generate multiple texts in one run. It can be used with batch_size to compute them in parallel, giving the whole process a massive speedup (in Colaboratory, set a maximum of 20 for batch_size).

Other optional-but-helpful parameters for gpt2.generate:

  • length: Number of tokens to generate (default 1023, the maximum)
  • temperature: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
  • top_k: Limits the generated guesses to the top k guesses (default 0 which disables the behaviour; if the generated output is super crazy, you may want to set top_k=40)
  • top_p: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with top_p=0.9)
  • truncate: Truncates the input text until a given sequence, excluding that sequence (e.g. if truncate=’<|endoftext|>’, the returned text will include everything before the first <|endoftext|>). It may be useful to combine this with a smaller length if the input texts are short
  • include_prefix: If using truncate and include_prefix = False, the specified prefix will not be included in the returned text

IV. The songs

We have trained a model with the 774M model and 500 steps and used it to generate 100 songs. The entire set of masterpieces can be found here. It had a lot of inspiration, leading to some interesting songs (we had to create titles for them ourselves):

  • “The story of Atlas V”: Here the bot was slightly dramatic, and developed a weird fixation on “Atlas V” though that word cannot be found in the training data set…
The story of Atlas V
  • “Mmmh”: In this song, the artist gets stuck in a suave showcase of its pretty voice, repeating “Mmmh” the entire time. In text generation it is pretty common for the model to get stuck in a loop and repeat the same word (series of words) over and over again. It is likely that our training data increased that effect, as many songs have some repetitions of a single word.
  • “No Uber on the Highway”: Our artist tells about the time the truck broke down on the highway. Not a great memory apparently.
No Uber on the Highway
  • “Do you have the keys of the kingdom, yo ?”Starting with a “What’s my name?”, one could think that our artist was hyping up the crowd for a sick hip-hop solo. However, it quickly evolves into some intense questioning, under the suspicion that somebody might be hiding the keys of a certain kingdom. Then it just gets very melancholic and intense.
Do you have the keys of the kingdom, yo?

Overall, the results look pretty good. It seems that our artist has some nice poetic inspiration. Yet, it is worth checking these masterpieces against plagiarism. That is, to see if our artist didn’t just copied some songs (or part of songs) from the training data set. For that, we split each of its 100 songs in lists of sequential N-grams (i.e. N following words). We then check how many of these N sequential words can be found in the original data set. The following figure plots the result.

For instance, we checked and the 5-Gram “why did you love me?” (from the tube Do you have the keys of the kingdom, yo?) already exists as is in the training set. It is part of the ~35% of repeating 5-Grams. However, the rest of that track is an original work from our artist. As expected, we have less repetitions for larger values of N. In total it seems our artist has a fair amount of inspiration and originality. It is worth noting augmenting the temperature parameter when generating texts should decrease these values.

Another thing that surprised us is how dramatic the 100 songs can be. Just scrolling through them shows a lot of melancholy from our artist. Most likely, the data set just happens to contain more sad songs than happy ones. It might be worth filtering them with some sentiment analysis for training a happy singing bot. Some further fun work could also be to train hiphop/metal head/pop bots.

Finally, we trained several other GPT-2 models on different data sets. That is, we trained models to:

  • Generate speeches from His Majesty King Philippe I of Belgium (in French and Dutch)
  • Generate Antwerp Hip-Hop
  • Generate product descriptions

It led to interesting results. However, those datasets were all much smaller than the English songs data set described in this blog post. Hence, the texts generated were not as good.

So our final tip is to try to find a large and good quality database, then just have fun generating a lot of texts.