07 September 2020, 16:31
How we trained a large-scale keyword-to-text model for composing real estate deeds in Dutch.
Generating text with artificial neural networks, or neural text generation, has become very popular the last couple of years. Large-scale transformer language models such as OpenAI’s GPT-2/3 have made it possible to generate very convincing text that looks like it could have been written by a human. (If you haven’t tried it out yourself already, I highly recommend checking out the Write With Transformer page.)
While this causes a lot of concern about potentially misusing the technology, it also brings with it a lot of potential. Many creative applications have already been built using GPT-2, such as for example the text-based adventure game AI Dungeon 2. The key idea behind such applications is to fine-tune the language model on your own dataset, which teaches the model to generate text in line with your own specific domain. Make sure to check out this blog post about how my colleagues at ML6 used this approach for generating song lyrics.
However, it is often difficult to generate text of equal quality in languages different from English since no large pre-trained models are available in other languages. Training these large models from scratch is prohibitive for most people due to the extreme amounts of compute power necessary (as well as the need for a large enough dataset).
At ML6 we wanted to experiment with large-scale Dutch text generation on a unique dataset. From our collaboration with Fednot (the Royal Federation of Belgian Notaries), we created a large dataset consisting of 1 million anonymized Belgian real estate deeds, which Fednot kindly agreed to let us work on for this experiment.
The idea: train an autocomplete model that can ultimately be used as a writing tool for the notaries to assist in writing real estate deeds. In order to make the tool even more useful to the notaries we decided to spice up the model a bit by adding keywords as extra side input. This allows for steering the context of the text to be generated, as in the following example generated by our model:
Example output from our model. The special tokens <s> and </s> denote the start and end of a paragraph. English translations are obtained with Google Translate.
In this blog post we will discuss how we trained our model. We will cover both the choice of model architecture and the data preprocessing, including how we extracted keywords for our training data. At the end we will show some results and discuss possible improvements and future directions.
Text generation can be phrased as a language modeling task: predicting the next word given the previous words. Recurrent neural networks (RNNs) used to be the architecture of choice for this task because of its sequential nature and success in practice.
In an RNN language model, an input sequence of tokens (e.g. words or characters) is processed from left to right, token by token, and at each time step the model will try to predict the next token by outputting a probability distribution over the whole vocabulary. The token with the highest probability is the predicted next token. (For an in depth explanation of how RNNs work, check out Andrej Karpathy’s legendary blog post “The Unreasonable Effectiveness of Recurrent Neural Networks”.)
An example RNN with character-level input tokens. Source: Andej Karpathy.
More recently, the transformer architecture from the paper Attention Is All You Need (Vaswani et al) has taken over the NLP landscape due to its computational efficiency. In a transformer model there is no sequential computation but it instead relies on a self-attention mechanism that can be completely parallelized, thereby taking full advantage of modern accelerators such as GPUs. You can find a great explanation of the transformer architecture in Jay Alammar’s “The Illustrated Transformer”.
Transformers are particularly successful in large scale settings, where you have several gigabytes of training data. An example of this is GPT-2, which was trained on 40GB of raw text from the internet at an estimated training cost of $256 per hour! However, RNNs (or its variant LSTM) remain competitive for language modeling and might still be better suited for smaller datasets.
Returning to our use case, we actually have enough data (around 17GB) to train a GPT-2 like model but chose to go with an LSTM-based architecture as the data is quite repetitive since many phrases are reused in several deeds so we figured that it would probably not be necessary to go for a full-fledged transformer architecture in order to obtain good results (also check out the recent movement towards simpler and more sustainable NLP, eg. the SustaiNLP2020 workshop at EMNLP).
In the end we went for a 4 layer LSTM model with embedding size 400 and hidden size 1150, in line with the architecture in Merity et al. We tie the weights of the input embedding layer to the output softmax layer, which has been shown to improve results for language modeling. Moreover, we add dropout regularization to the non-recurrent connections of the LSTM. (Adding dropout to the recurrent connections is not compatible with the CuDNN optimized version of the LSTM architecture, which is needed for training efficiently on a GPU.)
We use subword tokenization since it provides a good compromise between character-level and word-level input (see this blog post for more explanation). More specifically, we train our own BPE tokenizer with a vocab size of 32k using the Hugging Face tokenizers library.
Our basic language model architecture is now in place but the next question is how to incorporate keywords as side input. For this we take inspiration from the machine translation literature, where typically an encoder-decoder model with attention is being used (originally introduced in Bahadanu et al). Given a sentence in the source language the encoder first encodes each token into a vector representation and the decoder then learns to output the translated sentence token by token by “paying attention” to the encoder representations. In this way the decoder learns which parts of the input sentence are most important for the output at each time step. (Once again, Jay Alammar has a nice visual blog post about this.)
In our case, we will “encode” our input keywords with a simple embedding lookup layer. Our “decoder” is the aforementioned LSTM model, with an extra attention layer added before the final softmax layer. This attention layer allows the language model to “pay attention” to the keywords before predicting the next token.
Our final model architecture looks as follows:
High level picture of our model architecture at time step i.
In order to train our model we need a dataset consisting of pieces of text paired with their corresponding keywords. Our raw dataset from Fednot actually comes from scanned PDF files so the first step is to convert these PDF files into (pieces of) text. Our full data pipeline looks as follows:
The model was trained for 10 epochs using the Adam optimizer with gradient clipping and a learning rate of 3e-4. We did not do a lot of hyperparameter tuning so these training settings could probably be improved further. Our loss curve looks reasonable:
Loss per epoch during training, visualized using Tensorboard. The green curve is the training loss and the grey curve the validation loss. Both are smoothed with smoothing parameter 0.6.
It is finally time to take our newly trained model out for a spin but before doing so, let us first have a quick discussion on how to actually generate text from a language model.
Recall that a language model outputs a probability distribution over the whole vocabulary, capturing how likely each token is of being the next token.
The easiest way to generate text is to simply take the most likely token at each time step, also known as greedy decoding. However, greedy decoding (or its less greedy variant, beam search) is known to produce quite boring and repetitive text that often gets stuck in a loop, even for sophisticated models such as GPT-2.
Another option is to sample the next token from the output probability distribution at each time step, which allows the model to generate more surprising and interesting text. Pure sampling, however, tends to produce text that is too surprising and incoherent, and while it is certainly more interesting than the text from greedy decoding, it often doesn’t make much sense.
Luckily there are also alternative sampling methods available that provide a good compromise between greedy decoding and pure sampling. Commonly used methods include temperature sampling and top-k sampling.
Example text generated from GPT-2 using beam search and pure sampling, respectively. Degenerate repetition is highlighted in blue while incoherent gibberish is highlighted in red. Source: Figure 1 from the paper “The Curious Case of Neural Text Degeneration”.
We chose to go with the more recently introduced sampling technique called nucleus sampling or top-p sampling, since it has been shown to produce the most natural and human-like text. It was introduced in the paper “The Curious Case of Neural Text Degeneration” (Holtzman et al) from last year, which is a very nice and interesting read that includes a comparison of different sampling strategies.
In nucleus sampling the output probability distribution is truncated so that we only sample from the most likely tokens — the “nucleus” of the distribution. The nucleus is defined by a parameter p (usually between 0.95 and 0.99) which serves as a threshold: we only sample from the most likely tokens whose cumulative probability mass just exceeds p. This allows for diversity in the generated text while removing the unreliable tail of the distribution.
Let’s have a look at some examples from our model (we use nucleus sampling with p=0.95 in all examples):
We see that the text is pretty coherent and the model has actually learned to take the keywords into account. Next, let’s try with some more rare keywords from our corpus such as “brandweer” (occurs in only 51 training examples) and “doorgang” (occurs in 114 training examples).
The text is now quite nonsensical and includes some made-up words (gehuurgen?) but still the model managed to get the context more or less right.
Let’s finish off with a couple of more examples using more keywords as input.
Our first results look promising, at least when using common keywords as input. For keywords that don’t occur very often in our training corpus, the model seems to get confused and generates poor quality text. In general, we also noticed artifacts of OCR mistakes and pseudonymization mistakes from our training data, which limits the quality of the generated text. There is still lots of potential for improving our results, for example by further cleaning up the training data and by tweaking or scaling up the model architecture.
One of the lessons learned from this experiment is that you don’t always need a huge transformer model in order to generate good quality text and that LSTMs can still be a viable alternative. The model architecture should be chosen according to your data. That being said it would be interesting to further scale up our approach, using more training data and a bigger model to analyse this trade-off.
Text generation is a fun topic within natural language processing but applying it to real-life use cases can be hard due to the lack of control of the output text. Our keyword-enriched autocomplete tool provides a way of controlling the output through the use of keywords. We hope that this will provide notaries with a useful writing tool for assisting in writing Dutch real estate deeds.