Blog, Technical Study

A practical guide to fine-tuning Efficientdet for object detection

03 November 2020, 15:11

By: Jason Li, Jan Van Looy

Category: AI, Computer Vision, Machine Learning, ML6, Object Detection

Retraining Efficientdet for High-Accuracy Object detection

A practical guide to fine-tuning EfficientDet for transfer learning on a custom dataset


Many computer vision projects today revolve around identifying specific objects in pictures or videos and then performing certain operations on them such as classification or segmentation. Typically, a labelled dataset is created by domain experts and then machine learning engineers train or rather retrain an object detection model (applying transfer learning) on this dataset to be able to mimic the experts’ knowledge. When your use case primarily requires high speed such as in the case of video processing, today one of the YOLO variants are probably the go-to model, but when your use case requires high accuracy with manageable size and speed, EfficientDet should probably your first choice.

This blog post explains and demonstrates how to optimize this process when aiming for high accuracy using Google’s lightweight EfficientDet object detection model. (Want to jump directly to code?➡ Colab Notebook)

Efficientdet: A lightweight, high-accuracy object detection modelImage for post

 

EEfficientDet is an object detection model that was published by the Google Brain team in March 2020. It achieves state-of-the-art 53.7% COCO average precision (AP) with fewer parameters and FLOPs than previous detectors such as Mask R-CNN. It exists in 8 base variations, D0 to D7, with increasing size and accuracy. Moreover, an extra large version D7x was released recently, which achieves 55.1% AP.Image for post
Variant EfficientDet Models.

 

Choosing an EfficientDet model involves a trade-off between accuracy and performance, which depends on the use case. For example, if the model needs to be deployed on an edge device, then a small variant should be used. Similarly, if the goal is to use the model for real-time video analysis, a smaller variant would be preferred because it provides higher frames per second (FPS). On the other hand, larger variants are suitable for those tasks that tolerate longer inference time, e.g. one-time analysis of static images, as they show better results in terms of accuracy.

In this blog post, we provide a practical guide to fine-tuning EfficientDet on a custom dataset. As an example use case we will use license plate detection which can be used both in surveillance and anonymization applications. Thereby we will focus on the following:

  • Convert the custom dataset into the required TFRecord format
  • Explain the usage and effect of the most important hyperparameters and give some tips
  • Describe how to fine-tune EfficientDet using GPU
  • Benchmark on the test set

Getting the model

The original implementation of EfficientDet is open-sourced in the Google AutoML repository on GitHub. The repository is still actively being maintained, thus the implementation details in the Colab notebook may change somewhat as time goes by.

Notice that EfficientDet is also integrated in the TensorFlow Object Detection API v2. In this tutorial, however, we will focus on the original implementation as it is a pure implementation containing the latest features.

Choosing and preparing the dataset
The Chinese City Parking Dataset (CCPD) is a comprehensive license plate dataset containing over 250k unique car images with carefully annotated bounding boxes around the license plates. The images in the dataset represent real-world conditions featuring distortions such as rotation, snow or fog, uneven illumination and vagueness. This makes inference on this dataset a challenging proposition requiring a powerful, high-precision object detection model.

Image for post

Example images and labels in CCPD2019 dataset.

 

Before training the model, we first need to preprocess the dataset. CCPD is already split into a training set, validation set and test set which are each described by a text file. The labels are included in the filenames of the images, from which the bounding boxes’ coordinates and other information can be extracted by string resolving, e.g. using regular expressions. Finally, we need to convert the raw image files and labels into the TFRecord format, which is required by EfficientDet.

Hyperparameter tuning

EfficientDet provides a set of hyperparameters which allow for the alteration of both the network architecture and the training strategy. These should be fine-tuned in accordance with the training data, so that the model can fit the data better which usually results in more efficient training and better results.

The default hyperparameter settings can be found in the configuration file at hparams_config.py. They are wrapped by the default_detection_configs function. However, while these editable hyperparameters provide a lot of flexibility, they are not always documented in detail. Sometimes the usages or scopes of some hyperparameters can be confusing. Hence we give a brief description of some of the most important hyperparameters and how to fine-tune them effectively.

image_size:

Can be set as an integer, e.g. 640, which stands for 640×640. A rectangular resolution can be specified by a string such as “640×320”. However, in our experience, using rectangular resolution results in a lower mAP accuracy than a square resolution with the same settings.

In order to ensure the correct sizes of the backbone’s convolutional layers, it is advised to set both the width and height of the image to be a multiple of 16.

input_rand_hflip:

EfficientDet supports a number image preprocessing operations. By default the input image has 50% of chance to be flipped horizontally, which acts as a basic image augmentation technique.

jitter_min/max:

These two hyperparameters define the range of the random scale factor for the scaling preprocessing, which is another basic image augmentation technique. By default they are set to 0.1 and 2.0 respectively, meaning each input image will be scaled by a random factor within range [0.1, 2.0]. If the scaled image is larger than the required input size, a random region with the required input size will be cropped. If the scaled image is smaller than the required input size, the image will be padded with zeros to match the required input size.

use_augmix:

EfficientDet supports many data augmentation strategies, as described in this paper. This hyperparameter can be set to a string “v0”“v1”“v2” or “v3” to use different compound data augmentations. More details are explained in aug/autoaugment.py.

num_classes:

Number of classes. It should be set to the number of object types in your dataset. In our case we only have only one type of object that needs to be detected: license plates. We should then set this hyperparameter value to 1. Notice that in the current implementation the background class is implicitly reserved. This setting may change in the future so you should pay attention to this as it directly affects the model outputs.

label_map:

The mapping between the class as an integer and its label in string format. In our case we can set it to a one item dictionary “{1: license_plate}” in the YAML configuration file. The label “license_plate” will then be plotted when visualizing the detection results.

aspect_ratios:

EfficientDet is an anchor-based detector so the anchor setting is vital to the model training. This hyperparameter defines a list of float numbers, whereby each float represents an aspect ratio (w/h) of the anchor box. By default there are 3 aspect ratios 1.02.0 and 0.5. However, when we retrain EfficientDet on our custom dataset, we should modify them according to our data. The aspect ratios should reveal the average shape of our objects in the dataset, so that the model can easily fit the anchor box to the ground truth bounding box. If the anchors’ aspect ratios are deviating too much from the actual objects’ aspect ratios, it’s more difficult to train the detector because it relies on the regression subnet to fix the coordinates of the resulting bounding boxes.

The K-means clustering algorithm is generally used to find the average shape of objects in the custom dataset. For more details, you may refer to this repository.

num_scales:

Each anchor box can also have multiple scales. This hyperparameter is set to 3 by default which can often be left unchanged. You may change it if you want to have anchor boxes with more fine-grained scales, e.g. when you have large input images.

Assuming there are 3 aspect ratios and 3 scales, then there are in total 3×3=9 anchor boxes at each anchor position in the image.

anchor_scale:

This is a general scale factor for all the anchor boxes. It’s set to 4.0 by default which can normally be left unchanged. You may decrease/increase it when you’re facing mainly small/large objects in your dataset, so that all the anchor boxes are scaled down/up, which makes it easier to fit the actual objects.

is_training_bn:

EfficientDet contains batch normalization layers whereby the mean and variance of the batch normalization layer is different for training and inference. In order to use the correct mean and variance, this hyperparameter will be automatically set to True during training, and to False during inference. There’s no need to create two configuration files to specify this hyperparameter for training and inference.

heads:

It can be set to [‘object_detection’] to create an object detection model, or set to [‘segmentation’] to create an object segmentation model. Notice that the original EfficientDet paper only discusses the object detection model. The image segmentation functionality was added recently.

strategy:

Set to None to use the default TensorFlow training strategy, namely CPU or GPU. It can also be set to “gpus” or “tpu” to use additional computational resources.

mixed_precision:

Set to False to always use FP32 precision. Set to True to use a mix of FP16 and FP32. Using FP16 will lose some precision but speed up the model training. It can also save a significant amount of GPU memory so that a larger model can be fitted. However, based on our experience with NVIDIA Tesla P100 and NVIDIA RTX 2080Ti GPUs, using the mixed precision results in very slow training. The reason could be incompatibility of some GPUs with EfficientDet’s implementation.

var_freeze_expr:

By default, the training of EfficientDet will apply to all layers. However, sometimes freezing some layers can provide better results. This hyperparameter is a regular expression that indicates which layers in the network we would like to freeze. We can use the following script to print out all layer names of a checkpoint file:

from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file

print_tensors_in_checkpoint_file(file_name=CKPT_PATH, tensor_name=’’, all_tensors=False)

Based on the printed layer names. We can set this hyperparameter to “(efficientnet)” to freeze the backbone, or set it to “(efficientnet|fpn_cells|resample_p6)” so that only class_net and box_net will be fine-tuned.

Modify the hyperparameters

If you want to modify these hyperparameters, you can create a YAML file and pass its path to the training script. Then your custom setting will override the default settings in hparams_config.py.

In our use case, we create the following YAML file to specify the image resolution, class information and basic data augmentation parameters:

image_size: 1024×1024

num_classes: 1

label_map: {1: license_plate}

input_rand_hflip: true

jitter_min: 0.8

jitter_max: 1.2

Pre-trained weights

Since we want to fine-tune EfficientDet, we first need to download the weights that resulted from pretraining on the COCO dataset. The pre-trained weights of all variant models are provided at Google AutoML repository. You can download the appropriate model and decompress it to automl/efficientdet directory (where you can find main.py).

In our case, we download a lightweight variant EfficientDet D1 which has 6.6M parameters and its pretrained weights which have a size of 46.6MB. This should be sufficient to achieve a relatively high mAP accuracy on the CCPD dataset.

Training

When the TFRecords, hyperparameter configurations and pre-trained models are all ready, we can start training by running main.py. There are several additional arguments that we can pass to main.py. In our case, we fine-tune the pre-trained EfficientDet D1 for 1 epoch with 5000 steps. We set the batch size to be 2 so that 11GB GPU memory is sufficient to train it. Since the total number of steps is 5000 and batch size is 2, in total 10,000 examples from the training set are used.

During training, the model checkpoints will be automatically saved in a temporary directory. In fact, when we run the training script, it will first try to restart training from the latest checkpoint in this temporary directory. If none are found, it will restart training from the original pre-trained checkpoint that we put in the efficientdet folder. Therefore, make sure you delete this temporary folder if you intend to retrain from the original pre-trained model. Especially when you change some hyperparameters and wish to restart the fine-tuning process again, you should delete this temporary directory to avoid strange errors.

The training should be finished after about one hour using one GPU and should result in around 74% mAP, 100% mAP50 and 94% mAP75 on the validation set which is far from bad in view of the small training set and short training time. The formal evaluation result is as follow:

Image for post

Exporting and inference

The retrained model is saved as a checkpoint file in which the model weights are stored but not the model architecture. In order to do inference using our fine-tuned model, we first need to export the model into the saved model (.pb format). Notice that the pre-processing and post-processing operations will also be integrated into the exported model, so there are some parameters that you should specify during the exporting, e.g. the minimum score threshold that is used to filter out low confidence bounding boxes.

Here are some result images when we run inference on the “ccpd_fn” subset:

Image for post

Image for post

Conclusion

EfficientDet is a powerful yet lightweight object detector that is relatively easy to retrain on custom datasets. Its various models can cover most use cases in terms of either speed or accuracy. Although EfficientDet is in general slower than YOLOv4, it can assure you higher accuracy and it’s rather easy to get started with. The Google AutoML implementation of EfficientDet supports a variety of hyperparameter settings that allow you to easily fine-tune the model. Hence, whenever you are facing an object detection problem, you could take EfficientDet into consideration. Hope this tutorial has offered enough help for you to get EfficientDet working on your own dataset. (Our code can be found in this Colab Notebook)