Blog

How ML6 Built a Multilingual COVID-19 Assistant Powered by NLP

10 April 2020, 09:06

The COVID-19 pandemic is changing the world rapidly. Every day new guidelines or containment measures are announced and it is not easy to keep up with the latest updates. There is a wealth of information available on the internet but the information is scattered on many different websites, some more trustworthy than others, and finding the right answers can take a lot of time. This is especially the case for expats or immigrants who may not speak the native language of their country of residence.

Keeping people well-informed is an important step in flattening the curve. This is why we at ML6 decided to build a multilingual COVID-19 assistant powered by AI, more specifically Natural Language Processing (NLP). Our COVID-19 assistant makes it possible to ask any question in any of 16 languages and search through regulated (Belgian) FAQs for similar questions and answers. You can try it out for yourself at corona.ml6.eu.

In this blog post I will explain how I built the COVID-19 assistant in just 3 days. First we will have a quick look at how the FAQ database was scraped from multiple source URLs using the python library scrapy. Then we will dive into the part where the magic is happening: making the FAQ database searchable in a smart and multilingual way using state-of-the-art NLP techniques. Finally I will explain how I built an API around this and deployed it to Google Cloud Platform using Cloud Run. Since my own front end skills are very limited, my colleague Robin De Neef helped me out with the front end development and did an excellent job at making a nice user interface for querying the API. This will however not be covered in this blog post.

Scraping the data

In order to get started we first need a solid database of question-answer pairs to serve as the “knowledge base” of our COVID assistant. There are lots of FAQs available on the internet that we can simply scrape and combine into one database. To this end, we selected a small set of trustworthy URLs that are being kept up to date by official organizations (info-coronavirus.bevokaagoriavrtnws). Since I am working at the Ghent office of ML6, we focus on Belgian data sources that provide Belgium-specific information.

Web scraping can easily be done using the python library scrapy. All you need to do is define a “spider” containing the URLs you want to scrape as well as the logic for parsing the underlying HTML. The following code snippet shows how this is done for one of our data sources:

import json
import os
import scrapy

class FAQSpider(scrapy.Spider):
    name = "faqs"
    start_urls = [
        'https://www.info-coronavirus.be/nl/faq/'
    ]

    def parse(self, response):
        questions = response.css('article').css( 'summary.faq_question').getall()
        answers = response.css('article').css('div.faq-answer-wrapper').getall()

        faq_list = [{'question': q,
                     'answer': a,
                     'source': response.url}
                    for q, a in zip(questions, answers)]

        filename = 'faq-info-coronavirus.json'
        with open(os.path.join('output', filename), 'w') as f:
            json.dump(faq_list, f)

The start_urls variable contains the URLs you want to scrape and the parse method contains the logic for extracting the question-answer pairs from the resulting HTML, and for saving the result to a file in our local file system.

The code for running the spider looks like this:

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner

runner = CrawlerRunner()
deferred = runner.crawl(FAQSpider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()

After running the spider for all of our data sources we end up with around 500 question-answer pairs saved to local files in json format. Each of the json files contains a list with question-answer pairs along with their source URL.

Semantic Search using NLP

Having our FAQ database ready, the next thing we need is a way to make it searchable. This is where NLP comes into play. Traditionally, search engines are based on keyword search where you only get results matching the exact keyword that you typed in (a popular solution for this is elasticsearch). However, you are probably also interested in results containing synonyms for your input keyword or just any related word. For example, in our COVID-19 use case, if you search with the keyword “travel” you probably also want to see results for “abroad”, “flight”, “airplane”, “train”, etc. Searching in this way based on the meaning of a word instead of the exact word is known as semantic search and it is a common problem in NLP.

In modern NLP, text is usually represented as (high-dimensional) mathematical vectors, also called embeddings, that capture the semantics of the text. One of the most popular techniques for creating such vector representations is Word2Vec, developed by Thomas Mikolov in 2013, where word embeddings are learned based on a huge corpus of raw text, for example all of Wikipedia. The idea is that words occurring in similar contexts should have similar meaning (i.e. similar embeddings). The resulting word embeddings from Word2Vec look something like this, when projected down to 2 dimensions:

Notice that words with similar meaning are located close to each other in the word embedding space. For this reason, word embeddings are very useful for semantic search: if we want to find words similar to “travel” we simply look for which word embeddings are lying closest to the word embedding of “travel”.¹

Word embeddings are great but we typically want to capture the meaning of a whole sentence, not just separate words. Luckily, there are lots of NLP models out there that do just that: compute sentence embeddings. The easiest way of obtaining a sentence embedding is to simply take the average of all word embeddings of the words in the sentence. This simple baseline works surprisingly well but we can do better. A recent trend in NLP is to learn contextualized embeddings, where the whole sentence (“context”) is taken into account before computing its embedding. This trend is motivated by the fact that the same word can have different meanings depending on the context. For example, the word “bank” has very different meanings in the following two sentences:

  • Yesterday I went to the bank to withdraw some money.
  • Yesterday I went for a walk along the river bank.

The Universal Sentence Encoder developed by Google is a state-of-the-art model for computing contextualized sentence embeddings. The model even has a multilingual version, where you can input text in any of 16 different languages and get a sentence embedding that captures the meaning, independent of the language. This is perfect for our semantic search use case: even though our scraped FAQ database is in Dutch, a user can search with a question in, say, English and still get good results.

To make matters even better the multilingual Universal Sentence Encoder is readily available at TF Hub and can be loaded in just one line of (python) code:

import tensorflow_hub as hub

model = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

Using the model is as easy as passing a list of text to the model object that we just loaded:

embeddings = model(text_list).numpy()

Voila, we now have a very easy way of computing state-of-the-art sentence embeddings that can be used for semantic search.


[1] Since word embeddings are high-dimensional (typically around 300 dimensions) the distance actually doesn’t behave very well. This is known as the curse of dimensionality. For this reason, a different similarity measure is used instead of the distance, namely the so-called cosine similarity. Roughly speaking, the cosine similarity measures the angle between two vectors instead of their distance.

Building the API

Having our FAQ database and NLP model in place it is time to put the pieces together in an efficient API that can serve user requests. The API is implemented with the Connexion framework of Zalando, which is a Swagger/OpenAPI first framework for Python on top of Flask.

The NLP model and our scraped FAQ database (located on Google Cloud Storage) are loaded into the API during startup, and we immediately compute sentence embeddings for all of the questions in our database, resulting in an “embedding space” that is used for semantic search. Note that we only embed the questions, not the answers, since we will only be searching for similar questions given a user input. Having an embedding space pre-computed during start up reduces latency when serving user requests. The start up process takes around 20 seconds in total (most of this time is spent loading the NLP model from TF Hub).

The API has a single route for serving user requests (GET /search), with query parameters to specify the user query, the amount of results to return, and the language of the resulting question-answer pairs. When a request comes in, the user query is embedded on the fly with our NLP model, the similarity score is calculated for all embeddings in our embedding space (FAQ database), and the top K question-answer pairs are returned. In case a language parameter was specified (with a non-Dutch value) we also use the Google Translate API to translate the resulting question-answer pairs before returning them. Note that since our embedding space is quite small (only around 500 embeddings) we can do the similarity search in a brute-force way, simply computing all similarity scores and taking the argmax. In case you have a bigger search space you might want to look into more efficient similarity search libraries such as faissannoy or even elasticsearch. However, in our case a full API call (excluding Google Translate) only takes around 20ms so no need to look into more efficient solutions.

Deploying to GCP

Now that we have a working API the final step of our project is to deploy it so that it is ready to serve client requests via a publicly accessible https URL. At ML6 we are a partner of Google Cloud so naturally we deploy all of our workloads to Google Cloud Platform (GCP).

GCP has many options for deploying APIs but given that we only have a single stateless container, I chose to go with Cloud Run, which is a fully managed compute platform for deploying containerized applications. The nice thing about Cloud Run is that it handles all infrastructure management for you and automatically scales up and down depending on your traffic. It even scales down to 0 when your API is not receiving any traffic so that you don’t pay anything when your API is not being used.

To deploy our API with Cloud Run we first need to containerize it using docker. This amounts to adding a dockerfile (make sure to include an entrypoint in the dockerfile so that Cloud Run knows how to run your code) and building the docker image with the docker build command.

Once we have built the docker image we push it to the google cloud container registry (eu.gcr.io/{project}/covid-assistant) where it can be picked up by Cloud Run. Now, deploying our API is as simple as running one command using the gcloud command line tool:

gcloud run deploy covid-assistant \
--image eu.gcr.io/{project}/covid-assistant \
--platform managed

After a couple of minutes the API is up and running. Cloud Run automatically generates a service URL which you can use to send requests to your API.

Now that the API is up and running only one question still remains: how do we keep the FAQ database up to date over time? We solve this by running an automatic job every morning which re-scrapes all of our FAQ sources, uploads the result to Google Cloud Storage and reloads these new files into the API. In order to achieve this, we deploy our web scraping code as a function with Cloud Functions, which can then be called periodically every morning using Cloud Scheduler. Cloud Scheduler is also used to call a /refresh endpoint of our API which reloads the FAQ database from cloud storage and re-computes the embedding space. Voila, our API is now automatically being kept up to date with a fixed schedule every morning.

Conclusion

We have now seen how publicly available tools can easily be leveraged to develop and deploy a production ready QA assistant. We described how to use scrapy to scrape your favorite FAQs from the internet and how to use the multilingual universal sentence encoder from TF Hub to perform embedding based semantic search within our FAQ database. We then explained how to wrap this functionality in an API and how to deploy it to GCP with Cloud Run, giving us efficient autoscaling for free. Finally, we saw how to automate the web scraping on a periodic basis using Cloud Functions and Cloud Scheduler.

The COVID-19 pandemic is having a big impact on all of our lives. We all need to stay home to help flatten the curve, in which communication is key to all members of our society. With our multilingual COVID-19 assistant we hope to help make information more accessible to a wide group of people, regardless of their native language. Go ahead and find the answer to all of your COVID-related questions at corona.ml6.eu.

Unusual videos with many scenes as if free porn