When a video is broadcast or streamed, we don't send actual frames but encoded data describing how to synthesise each frame at the end device. For efficiency, we want to represent the video in as little information as possible, but that requires complex algorithms. These algorithms break up the images into small blocks. The size and number of these need to be optimal: not too big, so that we retain critical detail; not too small, so that we avoid redundant information.
Deciding how to split up each image involves testing all the different possible combinations and choosing the optimal one. In some modern video codecs, this can mean up to 42 million combinations need testing for a single HD frame! As you can imagine, this is very time consuming and would not be viable for encoding long high-definition TV shows.
Fast and efficient video compression is vital for the BBC, so here at Research & Development, we are working on optimising the process. One approach to tackle this problem is to use ideas from the field of 'machine learning' (ML). This term describes many statistical algorithms which can 'learn' by detecting patterns in data. (Similar to how a child learns by example, if you give the algorithm an apple, and tell it: 'this is an apple', then next time it encounters said fruit it is more likely to know what it is.)
We can use ML for video compression to spot correlations in the image data and find criteria, whereby if satisfied, the blocks of pixels would be split up or vice versa. It does so by working out patterns and rules, for example, 'if a block contains lots of detail, consider splitting it up into smaller blocks for encoding'. This process saves time by avoiding redundant calculations while processing blocks with less detail.
To train these algorithms, we need to collect large amounts of data first. Luckily in video coding, we have plenty of data. In our research, we encoded a range of video footage with differing resolutions and different types of content. We extracted as much information about block splitting as possible.
When using ML algorithms, it is essential to keep the algorithm as simple as possible. We needed the ML optimisation to be fast, simple, and not more complex than the video compression process we were trying to avoid in the first place! Consequently, we used simple 'decision tree' (DT) algorithms. These are much more transparent to understanding than many 'deep learning' approaches and have trained models that are easy to implement into the video codec. Instead of adding complex convolutions and other neural network feature extractors, we use several parameters that are already computed within a video codec (for and around a given block of pixels).
The DTs were given lots of examples of coding units and told whether they were split up or not. Using this information, it can then form a tree of binary decisions, sorting the coding units into categories. Once the trees were 'trained' on known data, the algorithm could then estimate whether a new block of pixels that it had not seen before was likely to be split up or not, depending on its characteristics.
After training the models, criteria can be extracted from them in the form of very simple 'if' statements, for example, 'if X then do Y'. These were written into our open-source HEVC Turing codec, checking with the ML criteria before performing the long testing process, meaning that sometimes this could be skipped, saving time and energy.
We defined two novel metrics for a trade-off between accuracy and speed, allowing the DT tool to be configurable and applicable to more problems in the video coding field. Putting these rules 'learned' by the decision tree into the codec sped up the encoding process by over 40% on average with minimal difference to the video quality! More information about the method and results is in our paper presented at the IEEE International Conference on Image Processing in September 2019.
Following these results, we have also demonstrated the potential to apply this algorithm within AV1.
As we have seen, by video coding and machine learning working together, the encoding process can be carried out a lot faster while maintaining the same visual quality and data efficiency. Our proposed DT-based training algorithm can be reused for various encoder types and applications. It will adapt models according to carefully selected training data and enable quick optimisation choices for any given use - supporting our ultimate goal of bringing the audience higher quality and more immersive experiences.
This post is part of the Distribution Core Technologies section