ML6 Internship: Pedestrian Tracking Over Multiple Non-Overlapping Camera Viewpoints

14 February 2020, 09:17

by Jules Talloen

During my 8 week summer internship at ML6 I worked on a proof of concept of pedestrian tracking over multiple non-overlapping camera viewpoints*. In this post I will describe the process resulting in a demo web app deployed on Google Cloud Platform.

In the final demo it is possible to upload videos of different camera viewpoints and detect unique identities in each video. Together these unique identities form a gallery of each person seen in the whole network. It is then possible to query a certain identity within this gallery and retrieve results of the same or similar identities, seen from other cameras. The process is shown in the image below.

The pedestrian tracking over multiple non-overlapping camera viewpoints process. Source:


Pedestrian tracking over multiple non-overlapping camera viewpoints is a combination of 3 techniques and research areas: object detection, multiple object tracking (MOT) and re-identification (ReID). In the sections below each building block will be described in more detail.

The pedestrian tracking building blocks.
The research areas involved in pedestrian tracking.

Object detection

Object detection is required to detect each person in each video frame. The model receives an image as input and returns bounding boxes accompanied with a certainty for each detected pedestrian. Many accurate pre-trained models exist. Both Faster R-CNN and YOLOv3 are two decent choices with Faster R-CNN being a bit slower but more accurate as YOLO is a single shot detector.

Multiple object tracking

The second building block, multiple object tracking, is crucial to build an accurate gallery of unique identities. Object detection is applied separately on each frame and has no knowledge of temporal relations between those frames. It is the tracker that has to encode this temporal information by linking detections of each object throughout the whole video. An example for pedestrians is shown in the GIF below.

An example of pedestrian tracking in a single video. Each identity gets assigned a unique ID (and color) for the whole duration they are visible in the video.

A popular and accurate tracker is Deep SORT. The tracker is a combination of 2 techniques: a Kalman filter and a visual appearance encoder. The Kalman filter predicts the position of each pedestrian in the next frame based on the detected bounding boxes in the previous frames.

Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe.

A Kalman filter predicts the position and velocity of a subject based on previous sensor readings. Source:

Almost every pedestrian will follow a regular path throughout the video. Based on this idea the Kalman filter is able to accurately predict the next position of each pedestrian, based on the previously detected bounding boxes. This next position is then used to select a detected bounding box in the next frame that most likely belongs to the path of the pedestrian.

Unfortunately a Kalman filter alone is not enough for accurate tracking. Issues such as irregular paths or occlusion cause techniques purely based on movement to fail. An example of such a case is shown below.

The Kalman filter fails in case of irregular paths and occlusion. The pedestrians don’t follow a straight path throughout the video and some pass behind a sign rendering them invisible to the object detector for a couple of frames.

By adding a visual appearance encoder we can overcome the previously mentioned issues. This encoder is a convolutional neural network that extracts a feature vector from an image. In this case the image is the bounding box image of the target pedestrian. In the latent space imposed by these feature vectors, two extracted features corresponding to the same identity are likely to be closer than features from different identities. Closeness is measured using a distance metric such as cosine distance. The desired structure of the latent space is achieved through re-parametrization of the conventional softmax classifier. This enforces a cosine similarity on the representation space when trained to identify the unique identities in the training set. Now, when a pedestrian follows an irregular path or is occluded for a couple of frames, the encoder is still able to track them based on their appearance. By calculating the distances between the currently tracked pedestrians it is possible to detect which pedestrians are the same.

We are now left with one more issue: the feature extractor sometimes fails to separate the object from the background. As a consequence the feature vector contains background information causing similar, but not the same, identities with the same background to be matched. A simple fix is background subtraction. Using a KNN or MOG background subtractor we get a mask which can be fed to the feature extractor to only look at the foreground when extracting the feature vector.

To conclude, the combination of the movement and visual appearance information of pedestrians allows the Kalman filter together with the background subtractor and feature extractor to accurately track unique identities. This results in a gallery with a set of bounding box images of each identity.


The third, and final, building block is re-identification (ReID).

The core issue of re-identification is to seek the occurrences of a query person (probe) from a set of person candidates (gallery), where probe and gallery are captured from different non-overlapping camera views.

It is important to note that only now we are combining multiple camera viewpoints. The previous 2 steps were performed in the context of a single camera viewpoint only.

Until now we have created a gallery of identities from multiple camera viewpoints. Since the tracking was performed on a per video basis, some identities will occur multiple times in the gallery, each time as seen from another camera viewpoint. The goals is to, once again, link these identities together to form a cross-camera track. If the locations of the cameras and the time at which each person was seen by the camera is known, we can estimate each person’s location on the map over time.

Similar to the visual appearance encoder for MOT, a feature vector is extracted for each identity. Again, this allows distance metrics to be used on the vector embeddings. The difference is that now we have more than one bounding box image for each identity. This allows filtering the images to remove any outliers or faulty detections but it also allows to use temporal information. With a high enough frame rate it is possible to perform gait analysis but this is out of scope for this demo.

Even though we have multiple images of each identity, re-identification is a much more challenging task due to severe appearance changes across different camera viewpoints. Differences in camera position, illumination, color balance, occlusion, resolution, body pose… make the same identities look very different.

The re-identification research area can be further divided according to various criteria: single-shot vs multi-shot, feature based vs metric based, hand crafted features vs deeply learned features, contextual vs non-contextual, end-to-end vs separate, open set vs closed set… Below are some of the milestones.

Some re-identification milestones.
The OSNet architecture.

The research area is very active and diverse with dozens of papers released every year. Each claiming to achieve state of the art performance on major datasets. In May 2019 Kaiyang Zhou et al. released their Omni-Scale Feature Learning for Person Re-Identification paperThey use multiple convolutional feature streams, each detecting features at a certain scale. These features are then combined using an unified aggregation gate to dynamically fuse the multi-scale features. This aggregation is done using input-dependent channel-wise weights. To efficiently learn spatial-channel correlations and avoid overfitting, the model uses both pointwise and depthwise convolutions. By stacking these convolutions layer-by-layer, OSNet is extremely lightweight. Despite its small size the model is able to achieve state of the art performance on six person ReID datasets.

Below is an image showing that the model is able to detect the subtle differences between visually similar identities. Both identities are wearing black shorts and a white t-shirt but the one on the left had something printed on the t-shirt. The activation maps show that the model is able to learn discriminative features to distinguish between these 2 identities. Because of the multi-scale architecture the local pattern on the t-shirt was captured but within the context of the white t-shirt as a whole. In contrast, single scale models tend to only focus on local regions and ignore the context around it.

Activation maps of the OSNet model. Each triplet contains, from left to right, the original image, the activation map of OSNet and the activation map of a single scale model.

Re-identification is now just a matter of computing the pairwise distances between all identities in latent space. In the case of OSNet, a simple euclidean distance metric suffices. These distances can be pre-computed so that when a re-identification request is made, only a ranking of the top matches is required.

Next steps and other use cases

The approach described above is missing one important bit of information: pose information. By incorporating the pose of each person and feeding this information to the model, it is possible to get much better feature representations. The model is now able to split its features over the different body parts regardless of differences in pose. This can eliminate re-identification failure due to variety in body poses.

Pose estimation.

On top of pose estimation there are many other interesting techniques to investigate and there is plenty of room for improvement over the coming years.


The architecture of the final web application deployed on Google Cloud Platform is shown below. Through a browser a user can visit the frontend, request data from Cloud Storage and retrieve metadata from Firestore. The frontend is built using React with authentication and authorization from Firebase. Through the frontend the microservice API, running on Kubernetes Engine exposed through Cloud Endpoints, is invoked. The microservice calls the TensorFlow Serving container to perform object detection.

Below are some screenshots of the web frontend.

The page listing all identities together with their most representative images. It includes their assigned ID, by which camera they were seen and when they were seen.
Selecting an identity will bring you to its detail page. Here you can perform re-identification within a certain area, within a certain time period. The top matches are shown below. In this case the person was not seen by any other cameras so similar looking people were returned (dark clothing).
In this case the person was seen by another camera so it is listed below after clicking the ‘Re-identify’ button.
The cameras listing page shows a list of all cameras together with their location and status.


With the support of many helpful people at ML6 and the ease of use of Google Cloud Platform, I managed to deploy a fully working demo in only 8 weeks. I learned a lot about pedestrian tracking through extensive research and existing expertise within ML6. Furthermore I got my first hands-on experience with Kubernetes and lots of Google Cloud services. On top of acquiring all this new knowledge I also had a great time during those 8 weeks. Everyone was very friendly and always ready to help out and there was also plenty of room for amusement and jokes.

*Note on ethics

At ML6, we firmly believe in the application of AI to benefit society. With “do good” as one of our core values, we are committed to building reliable, safe technology and protecting it in a sustainable way. That’s why invest in, and are committed to our thrustworthy AI principles. This internship is highly linked to Ensuring Technical Robustness. It’s an educational project (not deployed on real data) that gives us more insight in possible risks, but also potential benefits of certain technologies. These kind of education projects are essential to help us better understand/measure the pros/cons and assess the net societal impact if similar solutions would be productionized. One of the potential benefits of this solution is that it allows for pseudonymization of video content as we can blur all people in the video and can replace it by unique identifiers of people within a single video that can potentially be linked to identifiers in other videos (rather than storing the full non-pseudonymized data as is currently the case).

Thanks to Juta Staes.