Video compression has become an essential component of multimedia streaming. The convergence of digital entertainment prompted the development of advanced video coding technologies capable of tackling the increasing demand for higher quality video content.
At BBC Research & Development, we are exploring how to apply machine learning (ML) to improve video compression techniques and, how to interpret Convolutional Neural Networks (CNNs) to derive simplified and efficient implementations.
The perception of colour is important in many different circumstances. For example, our recent work on automatic colourisation using artificial intelligence (AI), has direct applications for the restoration of archived content. Video coding can also benefit from colour prediction (estimation) through better compression rates by exploiting the correlations between the luma (brightness of the light) and chroma (colour information) components of video frames.
Choose what you would like our Visual Data Analytics team to explain in an interactive experience on trained neural networks and interpretable AI.
What we are doing
Our objective is to improve chroma intra-prediction using ML. Intra-prediction exploits redundancies within a video frame by predicting the content of specific areas using the neighbouring pixels. The size of the bitstream (steam of data) can be reduced, and we can achieve better compression rates by focusing on transmitting the differences between the prediction and the original frame rather than all the original frames. A colour frame is usually represented by three components: the luma and two chroma channels. Typically, video coding schemes first process the luma component and then use the compressed information plus the neighbouring chroma pixels, to compress the chroma components of the desired area.
Recently, researchers introduced the Cross-Component Linear Model (CCLM), which applies linear regression to predict the chroma from the luma. However, better predictions can be obtained by using more sophisticated ML techniques. Existing models based on CNNs provided significant improvements but with two main drawbacks: the increase of system complexity and the lack of control on which neighbouring pixels are needed to predict a single chroma sample. We improved the existing approaches by introducing a novel methodology based on attention models, and by simplifying the most complex parts of the prediction network. These mechanisms are trained to decide which neighbouring samples are the best to contribute to the prediction of each chroma pixel. So for each prediction position, our model learns (from all the possible surrounding pixels) to attend or focus on the more informative ones. For example, as shown in the video below, the grey/blue samples in the boundary have more weight on the bottom left area whilst the brown samples contribute more to the top right area.
Although attention models like these have helped us to gain control over the reference samples and better understand the prediction process, the underlying neural networks are still very complex. So our work also focussed on the simplification of the network architecture to obtain a more compact and explainable model which requires less computational resources (fewer operations) to arrive at the predictions.
Going deeper, our attention model is integrated into a hybrid neural network with three processing branches, as shown in the video below. The first branch (cross-component boundary branch) is a fully-connected network (FCN) that processes and encodes the colours on the boundary. We reduce the size of the FCN by using an autoencoder, a well-known deep learning technique that allows efficient data coding and compacts the input information.
In parallel, the second branch (luma convolutional branch) analyses the spatial patterns on the luma component, aiming to recognise portions of objects contained within the area we wish to compress. Then, the attention model fuses the information from both branches: transferring the encoded boundary colours processed with the first branch to the luma patterns extracted with the second. Finally, the combined features are transformed to actual colours using a third convolutional branch (the prediction head). Similar to our approach to interpreting CNNs for video coding, the CNNs in the second and third branch can be simplified by removing the non-linear elements (which transforms the outputs of each network’s layer). This technique allows us to devise how to compute the output of both branches without performing the numerous convolutions defined by the CNN layers. This process significantly reduces the number of parameters of the original network and accelerates the prediction process by reducing the number of operations.
Our research aims to explain what ML is doing so that we can deploy it more reliably and reduce complexity. However, we also need to ensure that the simplifications do not impact the efficiency of the compression. So we have evaluated this along with the encoding and decoding time of both the original hybrid CNN and our simplified approach. Our tests reveal that our attention model improves compression performance, while the simplification can significantly reduce the processing time while retaining the coding benefits of the original attention mechanism. While our work addressed complexity reduction modifying the network architecture, further simplifications can be obtained during the deployment process. We are aiming to look at hardware-aware implementations to integrate our system into future video codec solutions.
This post is part of the Distribution Core Technologies section