[Webinar] Innovating Media with AI and Cloud Technology

A webinar for media innovators seeking to learn about the deployment of Artificial Intelligence and Cloud technology across the media value chain and editorial processes.

In recent years, broadcasters, publishers and content producers have started to embrace the opportunities the digital era is bringing. Artificial intelligence and cloud technology in particular offer huge potential for media & entertainment companies – yet success to date has been reserved for the pioneering few.

AI can influence all parts of the media value chain, boosting creativity, efficiency and productivity of content editors and creators while helping consumers to find the content that matches their interests and current situation.

We will showcase trending use cases and customer stories, including leading Dutch public broadcaster AVROTROS, that serve as learnings for all companies in the media & entertainment sector looking to grasp the essence of data, deploying cloud services and the benefit of AI/ML in editorial processes.

We believe that challenging times like these call for increased collaboration, innovation and sharing of knowledge. Join us and engage in insightful content with our intelligent technology experts and peers.

Agenda

  • 14:00 - 14:15 Welcome message, ML6 & Google Cloud
  • 14:15 - 14:35 Accelerating innovation in Media using intelligent technology, Jens Bontinck, ML6
  • 14:35 - 14:55 Innovation in Public Media, Finus Tromp, AVROTROS
  • 14:55 - 15:00 Break
  • 15.00 - 15:20 Google Cloud for Media & Entertainment, Rick van der Veken, Google Cloud
  • 15:20-15:30 Q&A session

View on-demand

[hubspot type=form portal=5386477 id=e9e9c3d1-cb76-481a-bc6f-0527a46634c1]

[Webinar] Productionizing AI in Manufacturing

A webinar for manufacturing innovators seeking to learn about the application of Artificial Intelligence in production.

Artificial Intelligence (AI) and Machine Learning have ignited the fourth industrial revolution. Integrating new technologies into manufacturing systems along with data and predictive analytics will minimize raw materials, improve efficiency and optimize supply chains as well as increase sustainability.

This webinar serves as a pillar in showcasing the real value of driving AI projects into production, as a company’s competitive advantage is minimized when proof of concepts stay in the conceptual phase. We will showcase trending use cases and customer stories that serve as learnings for all companies in the manufacturing sector that are looking to grasp the true value of seeing their intelligent technology projects in live production.

Agenda

14:00 - 14:05 Welcome message, ML6 14:05 - 14:25 Leveraging AI to improve quality process by visual inspection, Jens Bontinck, ML6 14:25 - 14:45 Engineering digital transformation in manufacturing with Cloud Technology, Stijn Floren, Google Cloud 14:45 - 15:05 Developing AI for microchip production with Google Cloud, Arnaud Hubaux, ASML 15.05- 15:10 Closing message, ML6

View on-demand

<br> [hubspot type=form portal=5386477 id=c80e9895-0850-4ea7-b053-dd8438789ed0]    

How to help/fool an object detector

Surveillance cameras are a growing presence in public spaces across the world. It is predicted that their number will climb above 1 billion by the end of 2021. This fact, combined with the rapid increase of performance and availability of computing resources, has led to an increased interest in developing faster and more accurate people detection systems. Nowadays, free and open-source pre-trained models, such as YOLOv4 can run in real-time on commodity hardware with state-of-the-art performance and easy setup. Given these advances, concerns regarding privacy have been rising sharply. For this reason, researchers have developed an interest in how these models work and, in some cases, how they can be tricked. Generative Adversarial Network (or GANs) have been used to this effect by creating “stealth images” and even stealth t-shirts, making the wearer largely invisible to object detection models.  Our first goal in this blogpost is to explore if we can develop our own stealth images using open source technologies and test them. Moreover, early experiments with stealth images were carried out with smaller models that are currently no longer state of the art such as YOLO-tiny. Hence our second goal is to test if the techniques described still work for state of the art object detectors such as YOLOv4, which are much harder to fool. Finally, if we are to live in a world full of self-driving cars and robots, being seen rather than not being seen may also become a desirable goal so our third goal is to inverse the whole system and see if we can design an image for a t-shirt that can help object detectors recognize and avoid humans.   Interested in how we developed a stealth T-shirt and if object detectors recognize and avoid humans? Read the full interactive blogpost on our Medium blog.    

A Natural Language Processing look at the Belgian governmental agreement

A Natural Language Processing look : A tale of multilingualism, summarization and sentence embeddings

Earlier this quarter, Belgium finally got a federal government, after many months of haggling, discussion and debate.

Ok, so how do we want to tackle this?

Before, during and after elections a lot of things get said, written and stated by all parties involved. After a government is formed, they release a so-called “Regeerakkoord”, a governmental agreement if you will. Which is essentially a large written mission-statement of the focus points of the formed government. So: text, text, text, text. We’re sensing a little NLP ( Natural Language Processing) groove going on here .

Image for post

They certainly seem eager to get to it (source: De Standaard)

  • gather texts from the government standpoints on various topics
  • translate them to English
  • perform abstractive summarization
  • perform sentence embedding
  • do dimensionality reduction
  • plot them in a nice graph to compare

Interested in the plot and more ? Read the full interactive blogpost on our Medium blog. 

(more…)

A practical guide to fine-tuning Efficientdet for object detection

Retraining Efficientdet for High-Accuracy Object detection

A practical guide to fine-tuning EfficientDet for transfer learning on a custom dataset


Many computer vision projects today revolve around identifying specific objects in pictures or videos and then performing certain operations on them such as classification or segmentation. Typically, a labelled dataset is created by domain experts and then machine learning engineers train or rather retrain an object detection model (applying transfer learning) on this dataset to be able to mimic the experts’ knowledge. When your use case primarily requires high speed such as in the case of video processing, today one of the YOLO variants are probably the go-to model, but when your use case requires high accuracy with manageable size and speed, EfficientDet should probably your first choice.

This blog post explains and demonstrates how to optimize this process when aiming for high accuracy using Google’s lightweight EfficientDet object detection model. (Want to jump directly to code?➡ Colab Notebook)

Efficientdet: A lightweight, high-accuracy object detection modelImage for post   EEfficientDet is an object detection model that was published by the Google Brain team in March 2020. It achieves state-of-the-art 53.7% COCO average precision (AP) with fewer parameters and FLOPs than previous detectors such as Mask R-CNN. It exists in 8 base variations, D0 to D7, with increasing size and accuracy. Moreover, an extra large version D7x was released recently, which achieves 55.1% AP.Image for post Variant EfficientDet Models.  

Choosing an EfficientDet model involves a trade-off between accuracy and performance, which depends on the use case. For example, if the model needs to be deployed on an edge device, then a small variant should be used. Similarly, if the goal is to use the model for real-time video analysis, a smaller variant would be preferred because it provides higher frames per second (FPS). On the other hand, larger variants are suitable for those tasks that tolerate longer inference time, e.g. one-time analysis of static images, as they show better results in terms of accuracy.

In this blog post, we provide a practical guide to fine-tuning EfficientDet on a custom dataset. As an example use case we will use license plate detection which can be used both in surveillance and anonymization applications. Thereby we will focus on the following:

  • Convert the custom dataset into the required TFRecord format
  • Explain the usage and effect of the most important hyperparameters and give some tips
  • Describe how to fine-tune EfficientDet using GPU
  • Benchmark on the test set
Getting the model

The original implementation of EfficientDet is open-sourced in the Google AutoML repository on GitHub. The repository is still actively being maintained, thus the implementation details in the Colab notebook may change somewhat as time goes by.

Notice that EfficientDet is also integrated in the TensorFlow Object Detection API v2. In this tutorial, however, we will focus on the original implementation as it is a pure implementation containing the latest features.

Choosing and preparing the dataset The Chinese City Parking Dataset (CCPD) is a comprehensive license plate dataset containing over 250k unique car images with carefully annotated bounding boxes around the license plates. The images in the dataset represent real-world conditions featuring distortions such as rotation, snow or fog, uneven illumination and vagueness. This makes inference on this dataset a challenging proposition requiring a powerful, high-precision object detection model. Image for post Example images and labels in CCPD2019 dataset.  

Before training the model, we first need to preprocess the dataset. CCPD is already split into a training set, validation set and test set which are each described by a text file. The labels are included in the filenames of the images, from which the bounding boxes’ coordinates and other information can be extracted by string resolving, e.g. using regular expressions. Finally, we need to convert the raw image files and labels into the TFRecord format, which is required by EfficientDet.

Hyperparameter tuning

EfficientDet provides a set of hyperparameters which allow for the alteration of both the network architecture and the training strategy. These should be fine-tuned in accordance with the training data, so that the model can fit the data better which usually results in more efficient training and better results.

The default hyperparameter settings can be found in the configuration file at hparams_config.py. They are wrapped by the default_detection_configs function. However, while these editable hyperparameters provide a lot of flexibility, they are not always documented in detail. Sometimes the usages or scopes of some hyperparameters can be confusing. Hence we give a brief description of some of the most important hyperparameters and how to fine-tune them effectively.

image_size:

Can be set as an integer, e.g. 640, which stands for 640x640. A rectangular resolution can be specified by a string such as “640x320”. However, in our experience, using rectangular resolution results in a lower mAP accuracy than a square resolution with the same settings.

In order to ensure the correct sizes of the backbone’s convolutional layers, it is advised to set both the width and height of the image to be a multiple of 16.

input_rand_hflip:

EfficientDet supports a number image preprocessing operations. By default the input image has 50% of chance to be flipped horizontally, which acts as a basic image augmentation technique.

jitter_min/max:

These two hyperparameters define the range of the random scale factor for the scaling preprocessing, which is another basic image augmentation technique. By default they are set to 0.1 and 2.0 respectively, meaning each input image will be scaled by a random factor within range [0.1, 2.0]. If the scaled image is larger than the required input size, a random region with the required input size will be cropped. If the scaled image is smaller than the required input size, the image will be padded with zeros to match the required input size.

use_augmix:

EfficientDet supports many data augmentation strategies, as described in this paper. This hyperparameter can be set to a string “v0”“v1”“v2” or “v3” to use different compound data augmentations. More details are explained in aug/autoaugment.py.

num_classes:

Number of classes. It should be set to the number of object types in your dataset. In our case we only have only one type of object that needs to be detected: license plates. We should then set this hyperparameter value to 1. Notice that in the current implementation the background class is implicitly reserved. This setting may change in the future so you should pay attention to this as it directly affects the model outputs.

label_map:

The mapping between the class as an integer and its label in string format. In our case we can set it to a one item dictionary “{1: license_plate}” in the YAML configuration file. The label “license_plate” will then be plotted when visualizing the detection results.

aspect_ratios:

EfficientDet is an anchor-based detector so the anchor setting is vital to the model training. This hyperparameter defines a list of float numbers, whereby each float represents an aspect ratio (w/h) of the anchor box. By default there are 3 aspect ratios 1.02.0 and 0.5. However, when we retrain EfficientDet on our custom dataset, we should modify them according to our data. The aspect ratios should reveal the average shape of our objects in the dataset, so that the model can easily fit the anchor box to the ground truth bounding box. If the anchors’ aspect ratios are deviating too much from the actual objects’ aspect ratios, it’s more difficult to train the detector because it relies on the regression subnet to fix the coordinates of the resulting bounding boxes.

The K-means clustering algorithm is generally used to find the average shape of objects in the custom dataset. For more details, you may refer to this repository.

num_scales:

Each anchor box can also have multiple scales. This hyperparameter is set to 3 by default which can often be left unchanged. You may change it if you want to have anchor boxes with more fine-grained scales, e.g. when you have large input images.

Assuming there are 3 aspect ratios and 3 scales, then there are in total 3x3=9 anchor boxes at each anchor position in the image.

anchor_scale:

This is a general scale factor for all the anchor boxes. It’s set to 4.0 by default which can normally be left unchanged. You may decrease/increase it when you’re facing mainly small/large objects in your dataset, so that all the anchor boxes are scaled down/up, which makes it easier to fit the actual objects.

is_training_bn:

EfficientDet contains batch normalization layers whereby the mean and variance of the batch normalization layer is different for training and inference. In order to use the correct mean and variance, this hyperparameter will be automatically set to True during training, and to False during inference. There’s no need to create two configuration files to specify this hyperparameter for training and inference.

heads:

It can be set to [‘object_detection’] to create an object detection model, or set to [‘segmentation’] to create an object segmentation model. Notice that the original EfficientDet paper only discusses the object detection model. The image segmentation functionality was added recently.

strategy:

Set to None to use the default TensorFlow training strategy, namely CPU or GPU. It can also be set to “gpus” or “tpu” to use additional computational resources.

mixed_precision:

Set to False to always use FP32 precision. Set to True to use a mix of FP16 and FP32. Using FP16 will lose some precision but speed up the model training. It can also save a significant amount of GPU memory so that a larger model can be fitted. However, based on our experience with NVIDIA Tesla P100 and NVIDIA RTX 2080Ti GPUs, using the mixed precision results in very slow training. The reason could be incompatibility of some GPUs with EfficientDet’s implementation.

var_freeze_expr:

By default, the training of EfficientDet will apply to all layers. However, sometimes freezing some layers can provide better results. This hyperparameter is a regular expression that indicates which layers in the network we would like to freeze. We can use the following script to print out all layer names of a checkpoint file:

from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file

print_tensors_in_checkpoint_file(file_name=CKPT_PATH, tensor_name=’’, all_tensors=False)

Based on the printed layer names. We can set this hyperparameter to “(efficientnet)” to freeze the backbone, or set it to “(efficientnet|fpn_cells|resample_p6)” so that only class_net and box_net will be fine-tuned.

Modify the hyperparameters

If you want to modify these hyperparameters, you can create a YAML file and pass its path to the training script. Then your custom setting will override the default settings in hparams_config.py.

In our use case, we create the following YAML file to specify the image resolution, class information and basic data augmentation parameters:

image_size: 1024x1024

num_classes: 1

label_map: {1: license_plate}

input_rand_hflip: true

jitter_min: 0.8

jitter_max: 1.2

Pre-trained weights

Since we want to fine-tune EfficientDet, we first need to download the weights that resulted from pretraining on the COCO dataset. The pre-trained weights of all variant models are provided at Google AutoML repository. You can download the appropriate model and decompress it to automl/efficientdet directory (where you can find main.py).

In our case, we download a lightweight variant EfficientDet D1 which has 6.6M parameters and its pretrained weights which have a size of 46.6MB. This should be sufficient to achieve a relatively high mAP accuracy on the CCPD dataset.

Training

When the TFRecords, hyperparameter configurations and pre-trained models are all ready, we can start training by running main.py. There are several additional arguments that we can pass to main.py. In our case, we fine-tune the pre-trained EfficientDet D1 for 1 epoch with 5000 steps. We set the batch size to be 2 so that 11GB GPU memory is sufficient to train it. Since the total number of steps is 5000 and batch size is 2, in total 10,000 examples from the training set are used.

During training, the model checkpoints will be automatically saved in a temporary directory. In fact, when we run the training script, it will first try to restart training from the latest checkpoint in this temporary directory. If none are found, it will restart training from the original pre-trained checkpoint that we put in the efficientdet folder. Therefore, make sure you delete this temporary folder if you intend to retrain from the original pre-trained model. Especially when you change some hyperparameters and wish to restart the fine-tuning process again, you should delete this temporary directory to avoid strange errors.

The training should be finished after about one hour using one GPU and should result in around 74% mAP, 100% mAP50 and 94% mAP75 on the validation set which is far from bad in view of the small training set and short training time. The formal evaluation result is as follow:

Image for post

Exporting and inference

The retrained model is saved as a checkpoint file in which the model weights are stored but not the model architecture. In order to do inference using our fine-tuned model, we first need to export the model into the saved model (.pb format). Notice that the pre-processing and post-processing operations will also be integrated into the exported model, so there are some parameters that you should specify during the exporting, e.g. the minimum score threshold that is used to filter out low confidence bounding boxes.

Here are some result images when we run inference on the “ccpd_fn” subset:

Image for post
Image for post

Conclusion

EfficientDet is a powerful yet lightweight object detector that is relatively easy to retrain on custom datasets. Its various models can cover most use cases in terms of either speed or accuracy. Although EfficientDet is in general slower than YOLOv4, it can assure you higher accuracy and it’s rather easy to get started with. The Google AutoML implementation of EfficientDet supports a variety of hyperparameter settings that allow you to easily fine-tune the model. Hence, whenever you are facing an object detection problem, you could take EfficientDet into consideration. Hope this tutorial has offered enough help for you to get EfficientDet working on your own dataset. (Our code can be found in this Colab Notebook)

 

ML6 nominated for the Deloitte’s 2020 Technology Fast 50

ML6 has been nominated for 2020 Deloitte’s Technology Fast 50 competition for technology companies headquartered in Belgium and founded in Belgium. The Fast 50 award will be given to the country’s fastest-growing technology company, based on its percentage of growth in turnover during the last four years. The winning companies will be announced at the Awards Ceremony on 26 November 2020.

ML6 helps organizations address and solve today's problems and challenges by implementing intelligent technology. Our team of Machine Learning experts is here to support and assist any company in their digital transformation and implementation of AI.  By applying the latest AI research, we keep our customers at the forefront of innovation.

We are proud to be nominated for the Deloitte’s Technology Fast 50. Our strategy and continuous growth over the past few years have strengthened our position in the market. We consider this nomination as a confirmation that we are heading in the right direction and as a recognition of our activities.

About the Deloitte’s 2020 Technology Fast 50

The Technology Fast 50 competition is an annual selection of the 50 fastest growing and innovative technology companies headquartered and founded in Belgium. Public or private companies who develop a technology related product or service and who have experienced substantial revenue growth over the last four years can enter the competition for their chance to be nominated as one of the 50 fastest-growing technology companies in Belgium.

Companies that have been active in the technology sector for less than four years can participate in the special Rising Star category of the competition. These companies are judged by an independent jury on their turnover potential and scalability.

Participating in the Technology Fast 50 competition can help companies to develop their business by increasing their visibility and giving them access to the Fast 50’s unique network of highly successful executives.

About ML6

ML6 is a leading artificial intelligence company that helps organizations accelerate their business and achieve competitive edge through the strategic, efficient and rapid deployment of AI. ML6 provides world-class AI expertise and builds custom state-of-the-art solutions for high stakes use cases. In four years, ML6 has successfully built a team of close to 80 senior AI experts, with offices in Ghent, Amsterdam and Berlin.

Note for editing, not intended for publication:

For more information about ML6, contact Filip Descamps, Head of Sales, filip.descamps@ml6.eu

For more information about Deloitte’s Technology Fast 50, please contact:

Nathalie Geentjens, Marketing & Communications Deloitte Belgium, ngeentjens@deloitte.com

ML6 Wins ‘Artificial Intelligence Innovator of the Year’

ML6 has won the award for 'AI Innovator of the Year' in the prestigious Data News Awards for Excellence. This award recognizes IT-companies that set the trend in creating innovation and adopting artificial intelligence technologies, either on the backend level or the customer facing level. The jury was especially enthusiastic about the fact that ML6 is often brought in through small projects after which they stay on for larger, long-term assignments. This clearly shows that ML6 creates added value for their customers. According to the jury, ML6 also is capable to scale most smoothly. [caption id="attachment_3670" align="alignnone" width="377"] Watch the official announcement here[/caption]   Thanks to our extensive experience and having executed over 130 unique use cases, ML6 does not only accomplish 'virtual progress', but pragmatic projects with measurable impact. ML6 goes beyond most AI providers by evaluating core business activities with and by our clients, to co-create new value strategies. This award is a great acknowledgement of how ML6 accelerates intelligence. We do this by accelerating the adoption of intelligent technology across industries, realising business impact using intelligent technology with our customers and accelerating the personal and professional development of our exceptional talent.

About the Data News Awards

The Data News Awards for Excellence is the most prestigious ICT event in Belgium and was run for the 21st time this year.  This year, they announced no less than 14 exceptional awards, each one a symbol of excellence in its category. These awards are the result of a solid selection process by the editorial staff of Data News, the readers of Data News and a jury of ICT professionals. For more information, visit http://datanewsawards.be/.

[Webinar] Improving the Shopping Experience with the Help of Artificial Intelligence (Computer Vision & Advanced Segmentation)

In this webinar hosted by Google Cloud, you will learn about different approaches to how data-driven decision making and automation can help retailers retain their customers and increase profit margins. Carlo Schmidt and Florentijn Degroote will present customer references and practical examples. Based on two specific use cases, they will show how Computer Vision can improve the customer experience and how retailers can better use their consumer data in order to address them more accurately. This webinar is aimed at marketing and e-commerce experts as well as IT and business unit managers from the retail sector and will be held in English.

Watch on-demand

[Webinar] Accelerate Life Sciences with AI

Artificial intelligence and cloud technology promise vast opportunities for economic benefits in the Pharmaceutical and Biotechnology domains in Life Sciences. This webinar provides a comprehensive view of the role that AI plays within the industry while also outlining the key challenges and opportunities within specific segments such as Research & Discovery, Drug Development, Manufacturing and Sales & Marketing. In times of disruptive innovation, success is based on a company’s ability to adapt, innovate and collaborate. This virtual event therefore serves as a platform that fosters industry relationships, providing the opportunity for the development of intelligent business solutions. Join us and engage in insightful content with our intelligent technology experts and industry peers.

When and where

Wed Nov 25 2020 / 14:00-16:30pm CET

Agenda

14:00 PM - 14:15 PM Welcome message, ML6 & Google Cloud 14:15 PM - 14:45 PM Accelerate innovation in Life Science using intelligent technology, ML6 14:45 PM - 15:00 PM Use of Machine Learning on hyperspectral images for automated classification of vaccine cakes, GSK 15:00 PM - 15:30 PM Engineering digital transformation in Life Sciences with cloud technology, Google Cloud Benelux 15.30 PM - 15:45 PM Pharmaceutical customer case study, Google UK 15:45 PM - 16:30 PM Showcasing innovative industry focused use cases and Q&A session

Register

Neural Text Generation From 1 Million Belgian Real Estate Deeds

How we trained a large-scale keyword-to-text model for composing real estate deeds in Dutch.

The basic idea

Generating text with artificial neural networks, or neural text generation, has become very popular the last couple of years. Large-scale transformer language models such as OpenAI’s GPT-2/3 have made it possible to generate very convincing text that looks like it could have been written by a human. (If you haven’t tried it out yourself already, I highly recommend checking out the Write With Transformer page.) While this causes a lot of concern about potentially misusing the technology, it also brings with it a lot of potential. Many creative applications have already been built using GPT-2, such as for example the text-based adventure game AI Dungeon 2. The key idea behind such applications is to fine-tune the language model on your own dataset, which teaches the model to generate text in line with your own specific domain. Make sure to check out this blog post about how my colleagues at ML6 used this approach for generating song lyrics. However, it is often difficult to generate text of equal quality in languages different from English since no large pre-trained models are available in other languages. Training these large models from scratch is prohibitive for most people due to the extreme amounts of compute power necessary (as well as the need for a large enough dataset).

Applying this idea

At ML6 we wanted to experiment with large-scale Dutch text generation on a unique dataset. From our collaboration with Fednot (the Royal Federation of Belgian Notaries), we created a large dataset consisting of 1 million anonymized Belgian real estate deeds, which Fednot kindly agreed to let us work on for this experiment. The idea: train an autocomplete model that can ultimately be used as a writing tool for the notaries to assist in writing real estate deeds. In order to make the tool even more useful to the notaries we decided to spice up the model a bit by adding keywords as extra side input. This allows for steering the context of the text to be generated, as in the following example generated by our model: Example output from our model. The special tokens <s> and </s> denote the start and end of a paragraph. English translations are obtained with Google Translate. In this blog post we will discuss how we trained our model. We will cover both the choice of model architecture and the data preprocessing, including how we extracted keywords for our training data. At the end we will show some results and discuss possible improvements and future directions.

The model architecture

Text generation can be phrased as a language modeling task: predicting the next word given the previous words. Recurrent neural networks (RNNs) used to be the architecture of choice for this task because of its sequential nature and success in practice.

RNNs

In an RNN language model, an input sequence of tokens (e.g. words or characters) is processed from left to right, token by token, and at each time step the model will try to predict the next token by outputting a probability distribution over the whole vocabulary. The token with the highest probability is the predicted next token. (For an in depth explanation of how RNNs work, check out Andrej Karpathy’s legendary blog post “The Unreasonable Effectiveness of Recurrent Neural Networks”.) An example RNN with character-level input tokens. Source: Andej Karpathy.

Leveling up: Transformers

More recently, the transformer architecture from the paper Attention Is All You Need (Vaswani et al) has taken over the NLP landscape due to its computational efficiency. In a transformer model there is no sequential computation but it instead relies on a self-attention mechanism that can be completely parallelized, thereby taking full advantage of modern accelerators such as GPUs. You can find a great explanation of the transformer architecture in Jay Alammar’s “The Illustrated Transformer”. Transformers are particularly successful in large scale settings, where you have several gigabytes of training data. An example of this is GPT-2, which was trained on 40GB of raw text from the internet at an estimated training cost of $256 per hour! However, RNNs (or its variant LSTM) remain competitive for language modeling and might still be better suited for smaller datasets.

Our model

Returning to our use case, we actually have enough data (around 17GB) to train a GPT-2 like model but chose to go with an LSTM-based architecture as the data is quite repetitive since many phrases are reused in several deeds so we figured that it would probably not be necessary to go for a full-fledged transformer architecture in order to obtain good results (also check out the recent movement towards simpler and more sustainable NLP, eg. the SustaiNLP2020 workshop at EMNLP). In the end we went for a 4 layer LSTM model with embedding size 400 and hidden size 1150, in line with the architecture in Merity et al. We tie the weights of the input embedding layer to the output softmax layer, which has been shown to improve results for language modeling. Moreover, we add dropout regularization to the non-recurrent connections of the LSTM. (Adding dropout to the recurrent connections is not compatible with the CuDNN optimized version of the LSTM architecture, which is needed for training efficiently on a GPU.) We use subword tokenization since it provides a good compromise between character-level and word-level input (see this blog post for more explanation). More specifically, we train our own BPE tokenizer with a vocab size of 32k using the Hugging Face tokenizers library.

Introducing key words

Our basic language model architecture is now in place but the next question is how to incorporate keywords as side input. For this we take inspiration from the machine translation literature, where typically an encoder-decoder model with attention is being used (originally introduced in Bahadanu et al). Given a sentence in the source language the encoder first encodes each token into a vector representation and the decoder then learns to output the translated sentence token by token by “paying attention” to the encoder representations. In this way the decoder learns which parts of the input sentence are most important for the output at each time step. (Once again, Jay Alammar has a nice visual blog post about this.) In our case, we will “encode” our input keywords with a simple embedding lookup layer. Our “decoder” is the aforementioned LSTM model, with an extra attention layer added before the final softmax layer. This attention layer allows the language model to “pay attention” to the keywords before predicting the next token. Our final model architecture looks as follows: High level picture of our model architecture at time step i.

The data

In order to train our model we need a dataset consisting of pieces of text paired with their corresponding keywords. Our raw dataset from Fednot actually comes from scanned PDF files so the first step is to convert these PDF files into (pieces of) text. Our full data pipeline looks as follows:
  1. Clause detection: A deed is built up of clauses. For example there could be a clause describing the real estate property, another clause about the price, and so on. We start by training a clause detection model, which splits up a full deed PDF file into its building block clauses using computer vision.
  2. OCR: We convert each clause into text by using out-of-the-box optical character recognition. This results in somewhat noisy text, which we clean up a bit by filtering out clauses with the worst OCR mistakes.
  3. Pseudonymization: The clauses are pseudonymized by replacing all sensitive named entities, such as names, addresses and account numbers, with randomized entities. The advantage of this approach is that even if our model would ever miss an entity, an adversary would never be able to recognize which entities are real and which are randomized. We trained our own custom named entity recognition model using spaCy.
  4. Splitting into paragraphs: Full clauses can be very long and diverse, so we split up each clause into paragraphs that are easier to capture with keywords. Moreover, we filter out paragraphs that are either very short or very long because those paragraphs are more difficult to learn from. We only keep paragraphs that are between 8 and 256 tokens long.
  5. Keyword extraction: For each paragraph we extract all nouns, verbs and adjectives using spaCy. These are the keyword candidates. In some cases the paragraph starts with a short title, which we can detect heuristically. In these cases we include the title as a keyword since it contains very concise information on what the paragraph is about. We now have a (rather large and diverse) list of keywords per paragraph. At training time we sample a number of these keywords based on their TF-IDF weights so that rare words get sampled more frequently. The number of keywords to sample is also changed dynamically during training: it can be anywhere from 0 to 6 keywords per paragraph. This allows for a very diverse way of using the model since the model learns to handle both inputs without any keywords and inputs with up to 6 keywords.

Training time

The model was trained for 10 epochs using the Adam optimizer with gradient clipping and a learning rate of 3e-4. We did not do a lot of hyperparameter tuning so these training settings could probably be improved further. Our loss curve looks reasonable: Loss per epoch during training, visualized using Tensorboard. The green curve is the training loss and the grey curve the validation loss. Both are smoothed with smoothing parameter 0.6.

Inference time

It is finally time to take our newly trained model out for a spin but before doing so, let us first have a quick discussion on how to actually generate text from a language model. Recall that a language model outputs a probability distribution over the whole vocabulary, capturing how likely each token is of being the next token.

Greedy decoding

The easiest way to generate text is to simply take the most likely token at each time step, also known as greedy decoding. However, greedy decoding (or its less greedy variant, beam search) is known to produce quite boring and repetitive text that often gets stuck in a loop, even for sophisticated models such as GPT-2.

Pure sampling

Another option is to sample the next token from the output probability distribution at each time step, which allows the model to generate more surprising and interesting text. Pure sampling, however, tends to produce text that is too surprising and incoherent, and while it is certainly more interesting than the text from greedy decoding, it often doesn’t make much sense.

Best of both worlds

Luckily there are also alternative sampling methods available that provide a good compromise between greedy decoding and pure sampling. Commonly used methods include temperature sampling and top-k sampling. Example text generated from GPT-2 using beam search and pure sampling, respectively. Degenerate repetition is highlighted in blue while incoherent gibberish is highlighted in red. Source: Figure 1 from the paper “The Curious Case of Neural Text Degeneration”. We chose to go with the more recently introduced sampling technique called nucleus sampling or top-p sampling, since it has been shown to produce the most natural and human-like text. It was introduced in the paper “The Curious Case of Neural Text Degeneration” (Holtzman et al) from last year, which is a very nice and interesting read that includes a comparison of different sampling strategies. In nucleus sampling the output probability distribution is truncated so that we only sample from the most likely tokens — the “nucleus” of the distribution. The nucleus is defined by a parameter p (usually between 0.95 and 0.99) which serves as a threshold: we only sample from the most likely tokens whose cumulative probability mass just exceeds p. This allows for diversity in the generated text while removing the unreliable tail of the distribution. Let’s have a look at some examples from our model (we use nucleus sampling with p=0.95 in all examples): We see that the text is pretty coherent and the model has actually learned to take the keywords into account. Next, let’s try with some more rare keywords from our corpus such as “brandweer” (occurs in only 51 training examples) and “doorgang” (occurs in 114 training examples). The text is now quite nonsensical and includes some made-up words (gehuurgen?) but still the model managed to get the context more or less right. Let’s finish off with a couple of more examples using more keywords as input.

Conclusion

Our first results look promising, at least when using common keywords as input. For keywords that don’t occur very often in our training corpus, the model seems to get confused and generates poor quality text. In general, we also noticed artifacts of OCR mistakes and pseudonymization mistakes from our training data, which limits the quality of the generated text. There is still lots of potential for improving our results, for example by further cleaning up the training data and by tweaking or scaling up the model architecture. One of the lessons learned from this experiment is that you don’t always need a huge transformer model in order to generate good quality text and that LSTMs can still be a viable alternative. The model architecture should be chosen according to your data. That being said it would be interesting to further scale up our approach, using more training data and a bigger model to analyse this trade-off. Text generation is a fun topic within natural language processing but applying it to real-life use cases can be hard due to the lack of control of the output text. Our keyword-enriched autocomplete tool provides a way of controlling the output through the use of keywords. We hope that this will provide notaries with a useful writing tool for assisting in writing Dutch real estate deeds.