Matthias Feys
Q / CTO
In this blog post series, we are looking at the importance of data for building performant ML models. The focus of our first blog post (check it out here if you haven’t already) was on how to unlock the full potential of data, looking especially at data labelling, data quality and data augmentation.
What do we do though if we don’t have any or enough usable data yet to get started? One obvious option is to collect (more) data. However, today we want to go beyond that. We will look at three other dimensions that can be relevant to unblock ML use cases: data protection, external data, and synthetic data.
How is data protection relevant for enabling us to build performant ML models? In fact, data protection can often be an unblocker for using data in our models. Let’s have a look at what we mean by this.
A first way how data protection can unblock data for ML modelling is of course the need to follow regulations (such as GDPR). Especially in an NLP context, we often deal with Personally identifiable information (PII). Without anonymisation or pseudonymisation of data, we would not be able to use the data at all or not in the most impactful way.
Next to unlocking the project, it can also have a major impact on the quality of the model. Let’s consider an ML project that contains PII data. Without anonymisation, we must remove the data as soon as possible. Depending on the application this could be three months. However, by investing in a proper anonymisation flow, this data will no longer be personally identifiable and can subsequently be stored for an indefinite period. The graph below shows the typical impact of model performance by training on more data points (represented by the period of time the data was collected on).
Data protection is also needed as a way to build trust and guard against potential attacks. Lastly, ensuring that a person with malicious intent cannot figure out whether a particular person was part of the training data, or is able to link anonymised data with other data to identify people, is crucial to protect privacy.
Protecting data can be done with various techniques. There is no “silver bullet” — often we use different techniques in combination with each other. Two of the most common and simple methods used are de-identification/data anonymisation and K-anonymisation. The former means the removal of personal information from the dataset, e.g. through blurring of faces in images. The latter revolves around creating anonymity in numbers — i.e. defining a minimal set of occurrences and grouping outliers, in order to protect individuals from inferences based on a small group size.
Other methods typically include some form of Noise Injection, meaning that we will replace a minimal set of attributes of data points to improve the privacy. In all cases (of course more clear for the noise injection methods), you will have to make a tradeoff between data utility & data privacy. The actual selection and combination of methods will be dependant on your use case.
As the name says, external data includes all data that has been collected from outside of the organisation. If there is no relevant internal data available (eg. for a totally new product) this might be the only option.
However also when you have relevant internal data, external data might be a good investment when the internal data is too costly to clean or in case you want to extend internal data. This can be either in quantity, by adding extra features (eg. including weather information) or by being more complete.
There are various types of external data sources to consider, depending of course on the availability and the needs of the project:
Between these different types, as well as between different vendors, there are important trade-offs to consider. The 3 different axis we typically score options are price, quality & time investment.
Going a bit deeper into each axis:
Unfortunately, just getting a hold of external data often does not mean we can readily use it. Keep in mind that external data often has to be cleaned, augmented and post-processed before being able to use it in your ML models.
Within your evaluation and business value calculations on the project, think about how much extra engineering needs to be done on the external data, e.g. to improve data quality, combine multiple data sources or join the data with internal data.
As mentioned in our first blogpost, collecting data and especially labelling it, is often a time-consuming and expensive task. Therefore, ML practitioners are increasingly looking at more efficient ways to generate usable data, from artificially expanding datasets by creating small variations on existing data points (data augmentation), to increasingly the use of hybrid or fully synthetic data.
Synthetic data has been in the spotlight for two main reasons: on the one hand we can increase the amount of available data to train on. On the other hand it can be a way to protect data. Ultimately, synthetic data can help to get more accurate, robust, fair and private models.
Although synthetic data can ultimately have many benefits, we typically see 3 use cases where it’s already useful:
In a way, techniques used by data augmentation and data pseudonymisation/anonymisation can be leveraged to create synthetic data. However, typically the idea is to create completely new samples that are even harder to link to an original data set. We consider 2 big blocks of approaches:
At this point we need to bring in a caveat: synthetic data is a trending, but still emerging field. This means that there are a lot of new frameworks that pop up, some of which also still get deprecated.
Some frameworks that are certainly worth to be checked:
To close off: typically the creation of new samples is not the hard part. However, making sure that the synthetic samples are useful and relevant is a lot harder, so certainly make sure you have a good way of measuring the quality of your synthetic data.
In this blogpost, we have put the focus on data — looking at how we can unblock ML use cases that lack usable data. We have shown how combining various privacy techniques can help protect against attacks and make it possible to use otherwise personal or confidential data. Next, we showed various options and trade-offs to include external data to further expand or build our dataset. Lastly, we took a closer look at synthetic data, an approach that still has to be further proven but promises the possibility to increase the size of our dataset and further protect privacy.