Jules Talloen
Machine Learning Engineer
This summer I did an internship of six weeks at ML6 on temporal action localization (TAL) for tennis videos. The main goal was to learn more about TAL and evaluate its current state with respect to industry applicability. It has definitely been a stimulating experience and I will gladly share some of my most interesting learnings. In the first part, I will briefly explain the application domain. Next I will discuss the three main strategies used in research to perform TAL benchmarks. Lastly, I’ll take a closer look at the Movinet model family and finetune a Movinet stream model.
It is a well known fact that in big tennis tournaments they use a system called Hawk-Eye to perform ball tracking. What most people don’t know is that this technology is also being used for player analysis. This is an extremely expensive system and it is definitely not feasible to install this on every amateur tennis field. It would be useful if we are able to do analysis on a single camera RGB feed. A first step is to temporally locate actions in an unclipped tennis stream, preferably in real time.
Tennis action recognition is an interesting domain as it consists of very short actions. In comparison, the benchmark data sets such as AVA, ActivityNet, etc. contain clips of sometimes even minutes. I am using the data from the paper by Hayden Faulkner et al¹. It consists of 11 classes, distinguishing between hits and serves and of which player performed the action. It contains 5 full length tennis games in which each frame is labeled.
The following gif shows what the end result should look like. The gif is again from the paper by Hayden Faulkner et al¹. I will be comparing my model against their implementation. They report an F1 score of 55.7 for their CNN-RNN model. Additionally, the model is compute intensive as it employs a temporal pooling sliding window. This is the baseline that I am trying to improve.
The first technique is temporal pooling, often used in research papers on video classification models to get benchmark scores on spatio-temporal labeling datasets such as AVA. I make the distinction between two definitions. The first definition uses a sliding window to label a video frame by frame. Essentially, for each frame to be labeled, it takes a window of size N around that frame and puts that sequence of frames through a video classification model. The output is a single label that is now assigned to only that center frame. Figure 1 shows a visualization of this process. This is what the paper by Hayden Faulkner et al¹ uses and that explains why their model is quite slow and not able to process frames in real time, which is highly preferred for our use case.
The second definition splits up an unclipped video in equal length clips. Each clip is then put through a video classification model and it’s result is given to each of the frames in that input clip. A visualization of the process can be found in Figure 2.
Interestingly, this second definition is used in the paper on SlowFast², a SOTA video classification model. It is used as a ‘hack’ to be able to use the model on the AVA data set and it works as follows. First, they use temporal pooling and take clips of 1 second from the unclipped video. Then for each clip they use Detectron2, FAIR’s library for SOTA detection and segmentation algorithms, to find bounding boxes for all of the people in the frame. Lastly, the areas inside these bounding boxes are put through the SlowFast model which results in a bunch of probabilities for each bounding box. A visualization of the result can be seen in the following gif. This is obviously not a good strategy for our use case. Tennis actions do not have a standard length as serves take more frames than hits. Besides, actions can also be really short and start and end timestamps need to be extremely accurate as actions happen in quick succession. That is not possible using temporal pooling because the start and end timestamps of an action can only fall on the boundaries of these equal length clips.
The second TAL technique is using an action proposal generator followed by a video classification model. An action proposal generator takes in an unclipped video and returns the temporal regions with a high likelihood to have an action in it. This allows us to use a video classifier on these temporal regions to get a label for them. A visualization can be found in Figure 3. This strategy copies it’s base from that of some object detectors. There they use a similar method in which the first step is to propose spatial regions of interest in an image and then that region is classified in a next step. Unfortunately, pretrained action proposal generators are extremely scarce. There are a lot of papers on these kind of models and their benchmark scores are promising, but the code is generally lacking.
The other models employ the third and last common strategy. They are the end-to-end models. These kinds of models are fairly new and even though there are some interesting papers to be read and good results being reported, the code and documentation to replicate these results are often incomplete with a lot of bugs.
Knowing this information, I can draw some conclusions. Firstly, at the time of writing it will be impossible to find a pretrained end-to-end model that I can finetune on our data. Secondly, temporal pooling is out of the picture as, depending on the definition that I use, it is either too slow or gives us too inaccurate start and end timestamps for an action. This brings us to conclude that an action proposal generator followed by a video classifier is our most viable route to check out at this point in time. For the video classification part of this strategy I will take a closer look at the Movinet model family in the next section.
The Movinet³ model family has been published in March 2021 by Google research. It tries to solve the recurring problem with video classification models, which is how resource hungry the models are. Figure 4 shows how Movinet compares in resource usage to other SOTA video classification models such as X3D.
This is very impressive and would enable us to do real time action recognition. What is even more interesting is the fact that they offer a ‘stream’ variant of the Movinet models. With this I am able to process a clip frame by frame as can be seen in Figure 5. The model takes an input state and frame, or sequence of frames, and it outputs a label and an updated state. This gives us a label for each frame, but still with the purpose of giving just one label for the entire clip by taking the last returned prediction. In the next parts of this article I will investigate whether it’s also possible to use this model on unclipped videos and thus return multiple labels per video.
I will be finetuning a Movinet a0 stream model pretrained on the Kinetics600 data set. As Tensorflow does not provide any documentation on finetuning the stream variant, this article can be read as a tutorial. To provide data for finetuning I use the keras VideoFrameGenerator class provided in the keras-video-generators package. My train and validation generators are called train and valid respectively.
The first thing that I need to do is load the model architecture from the Movinet source code⁴ as can be seen in the following code block.
Then, I need to load in the weights from the model pretrained on Kinetics600. You can find those weights on TensorFlow hub.
Next, I need to wrap these pretrained weights into a new model with a new classifier head so that finetuning on tennis data becomes possible. For this I first create a new Movinet classifier that takes the weights of the previous model and uses a new classifier head that takes the number of classes in my training data generator.
Following, I need to create the input layers for the model, which consists of one input layer for the frame or frame sequence and multiple input layers for the input states. These layers are then provided as input to the Movinet classifier.
To be able to finetune this classifier using the Keras API, I will need to wrap it in a TensorFlow model. However, our data generators only provide us with a label as target value and thus I need to discard the state output during training using a custom training step. Finally, I wrap the classifier in this custom model class.
Continuing, I only want to finetune the classifier head and thus I freeze all other layers.
This model takes as input a dictionary that contains both a frame, or sequence of frames, and input states. These input states can be defined as initial states using the Movinet source code. The target value should be just one label that is the same for every frame in that sequence. You can create these initial states using the following code in which NBFRAME stands for the number of frames in the frame sequence.
The data to finetune this model should then be provided in the following format.
After finetuning this model, I can show some results using it as a video classifier. The following table represents a confusion matrix for the different classes.
The accuracy on the test set is 82,69%, which is quite good, but can definitely be improved with some more tweaking of the hyperparameters. There are some obvious mistakes that the model makes. The first is the fact that it has difficulties with distinguishing between a right (HFR) and left hit (HFL) for the far player. Presumably, this has to do with the low resolution that this model takes, as for the near player this does not happen. By using a more refined Movinet model variant such as a4 or a5, this problem will likely be resolved. Secondly, it has difficulties with distinguishing between faulty and correct serves, most likely because faulty serves are underrepresented in the data set. This model is already using class weights and seemingly this does not improve the result by a large margin. The only solution that I see here is to gather more labeled data.
The Movinet model was also tested on unclipped videos by passing the unclipped video frame by frame through the Movinet model while updating the state. Unfortunately the results are not great. It can distinguish whether a rally is happening at the moment, but the precise action classification is mostly off. Only on rare occasions is it able to produce some accurate labels. This makes it clear that the Movinet stream model in its original form can not yet be used on unclipped videos. This is as to be expected as the paper also only discusses the results on clipped videos.
If I were to have a good action proposal generator, this Movinet model becomes useful as it can quite accurately classify clips of actions within a very reasonable timeframe. This would make real time labeling achievable and in the worst case I can still rely on a small buffer so that the action proposal generator can more easily define the start and end points of actions.
The results of this project are clear, namely there is a gap between research and industry in the field of temporal action localization. As explained in the previous paragraphs, research papers report impressive accuracy on TAL data sets, but pretrained models and well structured codebases are lacking. Additionally, the models that can be applied are often quite slow and compute intensive, which makes it almost impossible to use these in actual use cases. This makes it difficult for machine learning engineers to transfer the knowledge gained through academic research to industry. In this article I provide a short overview of the different TAL strategies used in academic research. From this, I can conclude that an action proposal generator followed by a video classifier seems to be the most feasible strategy at this moment in time. With the purpose of finding an implementation of this strategy for tennis action recognition, I show how to finetune a Movinet stream model and how it performs on simple classification tasks. The accuracy is acceptable and finding a proper action proposal generator is the next step in this quest for valuable tennis temporal action localization.
This blogpost is supervised by ML6 Engineer Jules Talloen and written by ML6 intern Timo Martens.
>> Read this article on Medium platform.
[1]: Hayden Faulkner et al. (2017). TenniSet: A Dataset for Dense Fine-Grained Event Recognition, Localisation and Description. https://github.com/HaydenFaulkner/Tennis
[2]: Christoph Feichtenhofer et al. (2019). SlowFast Networks for Video Recognition.
https://arxiv.org/pdf/1812.03982v3.pdf
[3]: Google research. (2021). MoViNets: Mobile Video Networks for Efficient Video Recognition.
https://arxiv.org/pdf/2103.11511.pdf
[4]: https://github.com/tensorflow/models/tree/master/official/vision/beta/projects/movinet