Jules Talloen
Machine Learning Engineer
Tennis is a sport played all over the world and a lot of single camera video streams are broadly available. All these streams contain information that could be turned into structured stats about games and players. Watching a tennis game, you could enrich the view automatically with numbers of serves through the middle or in the corners, how deep the ball is being played, preferences for left or right, depending on where the player is positioned... The list of insights you can extract from converting these video streams into structured data is huge.
Before getting any stats, the analysis of these video streams needs to be broken down into several tasks:
This idea formed the base of my internship at ML6. Many thanks to the team for their help and especially to Jules for being a great mentor. For my internship, we decided to start with the detection of the court lines
Every tennis fan knows that the Hawk-Eye system is capable of doing line (and ball) detection with very high precision using six or more high speed camera’s from different angles of the court.
We however want to perform the task with the standard video streams of one fixed camera and be able to make near real time predictions.
It is not the first time that somebody tries to detect a sports field. Before deciding on our approach we had a look at the following related work.
In the earlier days the task was mostly tackled with a symbolic AI approach following a similar pattern to detect the boundaries of the tennis court. At first, some technique is used to extract the lines from the frame. Hough line transformations are a great fit for purpose. A detailed explanation of Hough transformations and their use in OpenCV can be found here. The next step is to find out which of these lines are the outer court lines. By taking the intersections of 4 of the found lines, you can determine the transformation matrix (also called the homography) that projects a reference tennis court onto the 4 line intersections in the frame. Using this matrix, you can project the reference court on the frame and then count the overlaps between the line pixels in the frame and the projected reference court lines. By repeating this process for all possible found line intersections, we can find the best overlap which theoretically should be the outer court lines.
One can also combine a lightweight CNN with a Hough layer, which is in detail described in this paper. Similar work can be found in LSDNet where a classical Line Segment Detector is combined with a lightweight CNN.
Other sports often have moving cameras, only showing part of the field. Defining a key point grid on the field has proven to give good results for this particular use case. This paper details this approach.
Let’s first see how well the symbolic approach can reach the goals of doing near real time line detection on different courts. Secondly, let’s investigate if transfer learning on a trained model for classification can learn to predict the key points at the court line intersections. To conclude, we see if performance can be further improved by combining the two approaches.
Before diving into transfer learning, we have a look at how far we could get with a symbolic AI approach.
As described above, the high level steps are:
The preprocessing step is very important as this will determine how good the Hough transformation can extract the lines.
Starting from a good preprocessed image, the results are very accurate:
While giving good results for some courts, it is difficult to generalize this approach for all types of courts. Another downside is that the iterative process of determining the homography, creating the warp perspective and calculating the hits and misses is slow and resource-intensive.
The basic idea is to take a relatively simple existing Computer Vision backbone for classification and put a custom head on it to determine 16 key points at the line intersections of the tennis court.
We start with a ResNet50v2 backbone (Keras application) and experiment with a full CNN head and a Fully Connected Network head. The following parameters provide the best results with a ResNet50v2 backbone:
The predictions (blue dots) are still quite off but we learned that
To increase our dataset, we use the OpenCV Annotation Tool which has a great interface to label data and export it to — among others — COCO key point format. A bonus is that it also support interpolation so you only have to label the key frames.
Continuing the learning and tuning with different backends keeping in mind near real time inference speed, the results improved considerably
We still see that the predictions (blue dots) are a bit off target. In order to look for patterns in the learned model, we project the reference court (black lines) with the homography of the outer predicted key points.
This shows that the trained model finds the relations between the line intersections with very high accuracy (predicted blue dots are exactly on the intersections of the black lines). Are we maybe putting to much emphasize on the relationship between the points in our model?
So we try to only predict the 4 outer key points of the court. Cutting a long story short, the trained models on the 4 key points produced very similar results as the 16 key points.
During training and experiments, a few other backbones were tried as well and especially MobileNetv3Small performed much better than the other backbones.
So switching back to the 16 key point output, similar improvements were to be expected with the new backbone, MobileNetv3Small, and indeed the Mean Pixel Error reduced by more than 50% compared to the previously best results with EfficientNetV2Small.
This is a pretty good result but we want to improve and decided to post process the model using a variation of the symbolic approach from the previous section.
The idea is to take a rectangular area around every predicted key point of the original image and determine in every rectangle the lines and their intersection point.
If we crop a standard area around the predicted key points (red crosses), the zoomed in starting point for the post-processing looks like this:
You can see that determining the line intersection can be more complicated through players (15), unclear lines (5). It will also be hard to work with the key points at the net (8, 9).
The line intersection detection algorithm is as described in the symbolic AI section:
This results in the following improvements:
Or in the complete image:
Looking at the before and after post-processing image, the performance is clearly better. Sometimes however, post-processing can be a bit of because the lines are not accurately extracted. Comparing the mean absolute error now shows very similar results with or without post-processing. This can probably be improved by an additional verification step that only take post-processing into account when the calculated post-processing intersection is on a white pixel or matches a small filter specific for the key point (for example an L-shaped filter for the left lower baseline). This last verification step is not tested yet.
Performance was good, what about inference speed?
On a 8 core Intel i9 2.3 GHz processor with GPU NVIDIA T2000 4GB, the AI model roughly predicted 100 frames/sec. When including the post-processing, the inference speed dropped to 50 frames/sec. These figures are just indicative without any effort in speed optimalization.
These are promising results in both performance and inference speed.
We can conclude that our ML model is more robust and faster than our symbolic model.
For the ML model, it is vital to find a backbone with a good balance between performance and inference speed, in our case MobileNetv3Small. For all our tested backbones, Full CNN variants were best as head and MAE as loss function produced better performance. Data augmentation was crucial but make sure that the majority of transformations still make sense for a tennis court. Post-processing can further increase the performance as seen in some samples but you need to make sure it generalizes well.
With a little extra investment in post-processing, adding some temporal smoothing (there will only be slight changes in camera movement between frames) and giving some love to inference speed optimalisation, this approach is a good candidate for doing near real time court line detection... A first step in automating the stats of a tennis game.
This article is written by Bart Timmermans as part of his internship at ML6 and supervised by Jules Talloen.
>> Read the blog-post on Medium platform.