Making Our Office Safer by Detecting (Missing) Face Masks

04 May 2020, 09:41

As the world is going through the Corona crisis, more and more applications are created to help fighting it. At ML6, we wanted to use our expertise in Machine Learning to help with that effort. One way to minimize the risk of propagation of the virus is to wear protective face masks. Many governments are already asking their citizens to wear them in public places. We wanted to show that computer vision could help making sure that everybody is being safe. So we built an app that uses a camera feed and detects if anybody on the video is not wearing a protective mask.

We made a setup at our office with a Raspberry Pi and camera running at the entrance of our office. When a person correctly wears a mouth mask nothing should happen. But when someone enters our office without wearing one, the RaspberryPi will make a sound alert asking the person to wear a mask. This can help many offices to monitor whether safety measures are being respected, and to help remind people of the rules. Here is a video of it in action: (Please be aware of how to wear a mask: both your mouth and nose need to be covered).

Big thanks to everybody who worked on it, especially Juta Staes and Rebekka Moyson.

How we created a proof of concept in less then 4 days

We did so by using some powerful image processing tools and two consecutive object detectors. We use OpenCV to stream, process and modify the camera images. We then have a first model that will detect faces. For that we use MTCNN, a pre-trained face recognition model. We then trained our own Tiny YOLO model for detecting the person’s mouth for each detected face. The assumption is that if we can see a person’s mouth then the person is not using a face mask. This particular two stage design was very useful in several ways:

  • It allowed us to capitalise on pre-trained SOTA models as the MTCNN face detector.
  • It consequently reduced the problem complexity for the custom mouth detector (detecting a mouth on a cropped face is much easier than on a full room image).
  • Detecting mouths by opposition to masks allowed us to capitalise on already labeled data. To our knowledge there are no labeled data sets of protective masks available for computer vision. Furthermore detecting a protective mask is not necessary for this task, as the desired outcome is to detect people who are not wearing them.
  • By using small models like Tiny Yolo we could make it work on a live stream of images.

Note that you can find all the codes in our bitbucket repo. Let’s now take a deeper dive in how it was implemented.


Developing this app was done in two stages. First we needed to train our individual models. Then we needed to develop the actual app, which would stream a camera’s images, apply the two consecutive models to do the predictions and make a sound alert in case someone is not wearing a mouth mask.

1. Training

The good news for us was that there already exists some pre-trained face detection models. We used MTCNN, which is fast and performs well. Here is a quick code snippet of how you can get started with it.

import numpy as np
from mtcnn import MTCNN
from PIL import Image

detector = MTCNN()

path = 'data/image.jpg'

image  ='RGB')
image_np = np.array(image)
faces = detector.detect_faces(image_np)

The MTCNN face detection module also detects facial landmarks, meaning it also detects the coordinates of the mouth’s corners. However in practice this model always assumes that a mouth is present, and it will generate the mouth keypoints even if a person’s mouth is covered. So instead we decided to train our own mouth detection model. We decided to detect mouths instead of masks as we could find a large dataset of already labeled data for that task.

The model we wanted to create should take the cropped faces and predict whether there is a mouth on them. We decided to use Tiny YOLO for that part, from the yolov3-tf2 repo. It is a well performing and computationally efficient model, light enough to be hosted on a Raspberry Pi or a light weight computer in production environments. We generated training records from the CelebA data set, a large Kaggle data set of celebrities pictures including some information like mouth coordinates. Note that we first used the MTCNN model on it to crop every face it could detect. Then we adapted the mouth coordinates and trained our Tiny YOLO on that. We did so using a batch_size of 16 and 5 epochs. This is how we trained our model:

python \
--dataset ./data/train_celeba.tfrecord \
--val_dataset ./data/val_celeba.tfrecord \
--classes ./data/mouth.names \
--num_classes 1 \
--mode fit --transfer darknet \
--batch_size 16 \
--epochs 5 \
--weights ./checkpoints/ \
--weights_num_classes 80

Here is an example showing green boxes for the detected faces wearing a mask and red boxes when no mask is detected.

image source:

2. App

Now let’s look at how we built the demo pipeline. The demo will stream a camera’s images, apply predictions on it and play a sound alert if somebody is not wearing a mask. Let’s see what’s under the hood of this pipeline.

There are two ways of running the app. For testing you can run it locally on your pc, or for the setup at our office we run it on a Raspberry Pi.

Once the app is started it will start streaming the camera’s image. If you are running it on your computer, it will automatically use your webcam. On a Rasberry Pi it will connect with your Pi camera. A nice feature of it is that every frame will be taken as soon as all operations on the previous frame are done. Which means that if your models are a bit slow, the application will not freeze.


It was challenging to come up with a model detecting face masks. Two key designs helped us in solving it:

  • Flipping the problem around. Detecting mouths instead of masks allowed us to use existing data sets without having to label anything ourselves.
  • The two stage model combination allowed us to capitalize on an existing model and focus the custom part to a simpler task. The task would have been much harder if it would have been to detect mouths on the entire image.

That’s why one of the takeaways from this project is to be creative in your model architecture design. Even though using two models brings more risk of one of them failing.

Note that this app will not store any data. It is purely executed on edge and is meant as an example of the possibilities of Computer Vision. When putting such app in production we strongly suggest to carry out a Data Processing Impact Assessment (Art. 35 GDPR) to identify, manage and mitigate the privacy risks related with this technology.