31 October 2019, 16:56
During my 6 week summer internship at ML6 I did some research on Automated Machine Learning. In this blogpost I will give a quick overview of the different possibilities with their pro’s and con’s and I will go deeper into my favourite one: Auto-Keras.
Deep learning is present in all aspects of our lives and has given us great convenience in day to day tasks. Unfortunately though, it still takes expert knowledge to create a model that is robust and performs excellence at a given task. In view of making machine learning more accessible to the masses, research around Automated Machine Learning started. The goal is to have a program that builds its own neural network optimized for a certain task, without any human overhead in the process. In fact the first paper around Self-Organizing Neural Networks was published in 1988. The reason that Automated Machine Learning has only recently become a hot topic is because of the huge steps forward regarding both the performance and the computational power needed.
Many AI companies have made tools to provide functionalities such as Datarobot or Google Cloud AutoML. However, a lot of recent open-source and/or academic research has been done in the field, giving a wide variety of available methods to choose from.
Mathematically speaking, we are looking for the best function to map a series of input values (say images) onto the right output values (say cat vs dog). Since deep learning models are known to be universal function approximators, we can choose to regard neural networks as a subspace in the whole abstract model space. In fact, there are also methods available that decide among the more traditional machine learning tools. Auto-sklearn is one of those that I found to work great, but that won’t be today’s topic, since we will be focusing on neural nets.
An incredible amount of research has been done on the topic of Neural Architecture Search (NAS). There have been a lot of ideas to automatically design artificial neural networks. The most important differences are in the strategy of the two main steps:
1. How to search for a new model
2. How to derive its optimal performance in an efficient way
Step (1) resolves in practice in finding the best next mutation to your network in a step-by-step manner. Problem step (2) was originally approached by training the weights for every model from scratch and comparing the accuracy.
The first benchmark results were made by Google in 2016 using reinforcement learning (RL). A recurrent neural network (RNN) was trained to automatically generate child neural network architectures, while the accuracy was calculated to update the RNN controller. This work was extraordinary as it achieved state-of-the-art results, but there was only one big drawback: finding the optimal architectures on CIFAR10 took 28 days and 800 K40 GPUs in the search.
That took away the benefits for non-expert researchers or even companies. Improvements were made for efficient search using more advanced RL techniques. MetaQNN as an example uses Q-learning and doing so reduces the training time to 10 days on 10 GPUs. The breakthrough improvement was made by ENAS. Instead of training every model from scratch, it gives each child architecture of a predefined subspace shared parameters. Doing so, it tremendously decreased model search and training time to only 10 hours on 1 GPU, while maintaining the accuracy!
Next to reinforcement learning, also other methods were investigated. In research, very promising results have been found using evolutionary algorithms based on biological evolution, gradient descent calculating extrema in the search space, and random search with a special designed architecture representation. All of those proved effective given the improvements made over the years.
When comparing computational complexity, performance and ease of implementation, we decided to work with Auto-Keras. It is based on Bayesian Optimization: a mathematical tool to find the extremum of a black-box function without calculating derivatives. In our case we want to find the maximum performance as a function of the model’s parameters. Instead of derivatives, a distribution of queries over the function is used along with a decision function to determine the next query point. With the result of a query, the algorithm tries to learn the underlying probabilistic distribution of the deviation from the extremum of the function. In a trade-off between exploration and exploitation, Bayesian methods are very well-suited for functions that are expensive to evaluate.
In this case, the Bayesian method consists of 3 stages:
1. Generate the next architecture using the decision function
2. Train the generated architecture and observe the performance
3. Update the learned underlying probability distribution
As already hinted, most time can be won at step 2 because you don’t want to train every model from scratch. That is what made the training times explode in the first attempts. To resolve that, the authors used graph-level morphism, where they morph a parent network into a child network in such a way that the mutation operation can still achieve comparable performance. Next, the child network continues to train further. Previously, layer-level morphism was already explored. However, because of the non-linearities in the network, any change of a single layer could have a great impact on the whole network. In Auto-Keras they improved on this by systematically finding and morphing all layers that are influenced by the single-layer mutation.
In the next part, I will explain how the implementation of Auto-Keras can be done. For a more detailed approach, checkout the Jupyter notebook at the end of this blog post.
After importing the images and labels, as well as the Auto-Keras library with all the math tools as described above, you only need this code to automatically train a decent model:
clf = ImageClassifier()
clf.fit(x_train, y_train, time_limit=1 * 60 * 60)
clf.final_fit(x_train, y_train, x_test, y_test)
y = clf.evaluate(x_test, y_test)
But don’t worry, luckily we can extend this code in such a way that we can extract quite interesting results. In this example, we will work with 4 CPU cores and 1 GPU. Let me guide you through.
1. Import libraries and define your arrays
We train on a real-life dataset in this tutorial. It is a database of labelled pictures of microgreens: young vegetable greens that are loaded with nutrients.
We simply import the image classifier from Auto-Keras and we are ready to go. We use 2486 images of 10 different microgreen categories, and divide our data in train and test data after shuffling everything.
from autokeras.image.image_supervised import ImageClassifier
from autokeras.image.image_supervised import load_image_dataset
x, y = load_image_dataset(csv_file_path=CSV_MICROGREENS,
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, random_state=42)
2. Search for a classifier model
To initiate the search, we define our classifier. The only information we give along is where to save the models it finds and the maximal allowed time for the program to look for models, we chose 2h.
clf = ImageClassifier(path="automodels/", verbose=True)
clf.fit(x_train, y_train, time_limit=2 * 60 * 60)
As an option of the fit command, we can also let Auto-Keras split our data in train and test set itself. However, I prefer to be in control of this so it is not included here.
The Neural Architecture Search algorithm has now started. The Auto-Keras API receives the call, preprocesses the dataset for us (by performing both normalization and augmentation) and passes it to the Searcher to start. The algorithm that searches for new models is run on the CPU while the model training is done in parallel on the GPU. Therefore a GPU is recommended and having multiple CPU cores helps a lot. The current neural architecture is saved on RAM memory to have fast access. The Graph builds our new model into a real network and copies that to the GPU. After training, the performance is fed back to our Searcher and the probability distribution it learned is updated.
In the folder we gave along, all the trained models are stored as well as the logs that show the different mutations it performed. For different limitations in GPU memory, the program was learned to adapt to different memory sizes. It estimates the size itself and only trains models that don’t exceed the size limit. This is visible in the output:
When training, the Auto-Keras image classifier starts with a simple model: 3 convolutional blocks containing (ReLU-BatchNorm-Conv2D-Pooling) followed by (Global Pooling -Dropout-Dense-Relu-Dense-Softmax) and mutates from that.
In my experience, after a 2 hour search the model gets very deep and quite homogeneous. In this case, we finish with a deep neural network that looks like this:
I also let Auto-Keras train for a lot longer than this. Although the results in performance are negligible, it still interesting to see that the resulting model gets way denser and complexer:
A common defect in Automated Machine Learning schemes is that they only grow the architecture size by adding blocks, but they don’t shrink their model. In Auto-Keras however, they explicitly coded the decision function of the Bayesian Optimization as a tree-structured search that not only expands the leaves, but also optimizes the core nodes. They do so by carefully crafting a balance between exploration and exploitation.
3. Final Fit
Once convergence or the time limit is reached, the Bayesian Optimization stops and saves the best model. In our case, the time limit was reached and the program saved model 14 as optimal as it was not done training model 15 beyond its performance. Once Auto-Keras has figured out the best structure, we continue training our best model until convergence using the final_fit command. Now, it will train on a little more data by including the validation set. You can choose whether to set the retrain boolean to True or False: respectively keeping the weights fixed and continue training or starting again using a little more data and reinitialize the weights. This takes a long time and if you want a fast proof-of-concept result, it can be a lot quicker to set it to False. The best results however are of course found by letting it retrain all weights from scratch this time.
clf.final_fit(x_train, y_train, x_test, y_test, retrain=False)
result = clf.evaluate(x_test, y_test)
print('The resulting accuracy is ' + str(result))
As a result, we got an accuracy of 0.9775 % on our test set after a total training time of 3h28 (2h fit + 1h28 final fit).
4. Export the model
An Auto-Keras model cannot be exported as a Keras model. Since it also includes preprocessing, we can only use the model in an Auto-Keras environment during visualization or inference.
Now to compare Google’s AutoML with Auto-Keras, we are comparing oranges and apples. Google AutoML is popular because of the easy-to-use UI and the good results, but open-source packages such as Auto-Keras form a real threat. This is clear when comparing our results. However, we only looked at one example that is very well suited for the Auto-Keras framework. A broader investigation could be done in the future, because we don’t know what the Google service has up it’s sleeve.
If used in the right way, open source automated machine learning packages can make a big difference. Hopefully in the future this will remain giving the implied push to keep AI research open-sourced and well-documented. Doing so is one of the key factors for the success that the quickly evolving field has known so far.
Of course not! Even better, using tools like these we can skip some repetitive and boring stuff. It seems great to use Auto-Keras for simple tasks, for some proof-of-concept results and for data investigation. That way, we can focus on the more important tasks such as trying to translate a real-world problem in a machine learning problem and seeing which information is needed in our data to have reliable results. The tweaking of our model could then be left to an Automated Machine Learning infrastructure.
We used the computational power to find the optimal point in a model space which we defined ourselves. Please don’t use Auto-Keras just as a cheap black box tool. Understanding for which problems it fits and why this can be automated can give a real head start in machine learning problems. Real breakthroughs however, such as the idea of using convolutional and residual blocks in a network remain to come from our own human research and are only confirmed to be successful but are not (yet) invented by Automated Machine Learning.
If you decide to implement Auto-Keras for your next project, let me give you some final tips from my own experience: