Posted by Max Leonard on , last updated

For our vision of an object-based future to become reality, we will need new tools, languages and techniques to allow us to author, edit and share the content we create. We have already made significant inroads into what the infrastructure and lower level identity models will look like, and as these specifications make their way into standards, products, and eventually, business as usual, we are exploring the emerging possibilities in the new ways we can create content and describe compositions in the emerging world of object-based IP Production.

One of the powerful arguments for delivering object-based (as opposed to linear) media is the potential for having content adapt to the environment in which it is being shown. This allows us to give the viewer or listener the best possible experience for their physical surroundings, device capability and personal needs. This is something that has been standard practise in the world of the web for many years, with most modern websites able to accommodate and adapt to the wide variety of devices that may be used to view them with varying layouts, font sizes and levels of user-interface complexity. With such a wide array of well used and established web technology to choose from, we expect a sizeable portion of both the craft and consumer applications of the future to be based on HTML, CSS and Javascript, which holds obvious benefits for everyone as the support for ever more powerful in-browser video and audio processing makes its way into the web standards. 

BBC R&D - Object-Based Media

While these now-ubiquitous technologies are becoming ever more powerful, their massive flexibility can be the source of great frustration to the developers charged with maintaining the applications and experiences built using them, due to the plethora of different approaches available to accomplish any one task. This flexibility means that repeatability and consistency of approach among production teams is extremely difficult to maintain, especially when combined with the rapid evolution of these technologies and the sheer volume of possible avenues one can take when creating these new object-based media compositions. This has been the case in many of our web-tech based object-based media experiments (such as Forecaster, Responsive Radio and CAKE), which have relied on the HTML/CSS/JS holy trinity to make the experience possible, but have taken different approaches to accessing, describing and combining the media, making the content from one experience fundamentally incompatible with another in a more meaningful way than just being able to load and play-back the raw assets. 
The only way we can practise an object-based approach to broadcasting in a sustainable and scalable way at the same level of quality expected of us in our linear programming, is to create some sort of standard mechanism to describe these object-based compositions, including the sequences of media and the rendering pipelines that end up processing these sequences on the client devices. The crux of the problem, as with any standard, is finding the sweet spot between being well-defined enough to be useful, but free enough to allow for creative innovation. 
There are as many existing media composition formats in existence as there are video editors, but one of the common short comings with many of these formats is that they often make the assumption that the media is located in files on a disk. Realtime collaborative editing, which is becoming standard practise for the creation of text documents, is also the exception rather than the norm in media production contexts.


Over the past 6-12 months, we have been implementing some of our ideas on a new media composition protocol to try and get a handle on this problem.
The result is UMCP (Universal Media Composition Protocol - don't get too attached, it's a working title), which allows for the descriptions of media sequence timelines, processing pipelines and control parameters.
By using UMCP in conjunction with our VideoContext javascript library, whole production processing pipelines and their control parameter can be described, controlled and recreated in the playback clients; whether that be in the studio or on the devices of our viewers and listeners.
We have taken a cue from Operational Transforms, a solution to support multi-user-single-task working which powers popular applications such as Google Docs and Etherpad, which with a bit of domain specific adaptation, can be put to work in the arena of media production. 

R&D at the Edinburgh Festivals 

To validate some of these ideas, we took PRIMER, a prototype Nearly Live production application built by the R&D UX team to run on an NMOS cluster, using UMCP to record the PRIMER session and the edit decisions of the director and communicating these back to a central API and store. These edit decisions were then reflected out to a lightweight external viewer client for monitoring purposes and also onto another rendering server, to produce a high-quality baseband HD video version of the edit which could serve as the input to a traditional broadcast pipeline. 
The key point here, is that the exact same session description metadata is sent to every device, regardless of its capabilities, which can in turn render the experience in a way that suits it; either live, as the director makes the cuts, or at an arbitrary time later on. 
At the capture side, the raw UHD video is captured, encoded and saved to the media store, which is then delivered to clients based on the format relevant to their need upon request. The live production edit systems are able to grab the low-bandwidth proxies for thumbnail video views, enabling the low-latency in-browser display required for this application, and the high quality renderer is able to pull the full-quality feeds for its high-bandwidth uncompressed output. 
The key enabler here is the NMOS content model, which allows us to easily refer to media by a single identifier, irrelevant of its actual resolution, bit-rate or encoding scheme. 
Although this particular trial was focused on the production suite, it doesn't have to stop there. In fact one of the substantial benefits of working in this way would be to allow us to author experiences once, for all devices, and deliver the composition session data to all platforms, allowing the devices themselves to choose which raw assets they need to create the experience for themselves. Low-bit rate for mobile, high resolution for Desktop, 360 for VR headsets.  
This would allow the production team to serve potentially hundreds of different types of devices regardless of connection or hardware capability without having to do the laborious work of rendering a separate version for everyone. It's is a very tantalising prospect indeed.