« Previous | Main | Next »

Implementing startOffsetTime for HTML5

Post categories:

Sean O'Halpin | 14:45 UK time, Tuesday, 31 January 2012

In this first of two technical blog posts on our recent work on P2P Next, we explain why and how we implemented the HTML5 media element attribute startOffsetTime in Firefox to enable accurate synchronisation of out-of-band timestamped metadata with media streamed live to a browser over the internet.

First we'll explain our rationale for why we need the functionality this attribute makes possible. Then, we'll go into some technical detail as to how Firefox and Chrome currently interpret the specification. We'll explain how we built a proof of concept implementation in Firefox. Finally we'll state our position on the interpretation of the specification and highlight some challenges in getting this implemented more generally.

The reason for setting this out is that we'd like to see consistent support for startOffsetTime across all commonly used codecs and for browser vendors to bring their implementations into line with the published HTML5 media elements specification. There are ambiguities in the specification itself, such as the interpretation of 'earliest seekable position', which could be clarified, especially with respect to continuous live streaming media. Browser vendors need to agree on a common interpretation of attributes such as currentTime so others can experiment with the exciting possibilities this new technology is opening up.


One of the recurring themes of our work over the past few years has been synchronising web-based media with audio/video.

We've tried various techniques, each with their own pros and cons. Visualising radio, for example, relied on the fact that audio streamed live on the web is delayed relative to live FM by about 30 seconds. The solution we implemented was quite rough and ready and didn't take variable network latency into account - not everyone will experience the same amount of delay and buffering introduces even more unpredictability. We compensated by designing the experience not to depend on accurate synchronisation. Still, we knew we could do better.

For our Autumnwatch trial, we had a person manually trigger the synchronised events against a live broadcast. While this was pretty accurate, it's obviously costly in terms of people and doesn't address the fact that DVB and live streamed video are delayed with respect to live TV.

The Secret Fortune trial carried out last year used audio fingerprinting. This can be a reasonably accurate way of synchronisation with live TV but has issues. In particular, it's costly to implement and does not handle network latency and buffering well.

Analogue broadcasts and DVB/DAB are reasonably predictable, though receiver decoding and onboard digital image processing introduce their own delays. A trickier problem to solve is variable network latency over the internet.

What do we mean by 'out-of-band metadata'?

In the context of media streams, 'out-of-band' data means any data not sent in the same data stream as the audio and video.In the context of media streams, 'out-of-band' data means any data not sent in the same data stream as the audio and video. For example, the Matroska container format allows you to embed subtitles in-band along with the audio and video data so you don't need any other files to view subtitles when you play the video. At the same time, it means the media file must contain subtitles for all the possible languages you might want to use and that adding or changing subtitles means re-encoding the file.

An example of out-of-band data would be the .SRT subtitle files you can load for a film in a media player like VLC. This data is not contained in the same file as the video. This makes it easier to change subtitles or add new languages - you just need to distribute the updated .SRT file on its own.

Why out-of-band timed metadata?

So why do we want to enable out-of-band timed metadata for live streaming? The primary reason is to support 'second screen' applications for live broadcasts viewed in a web browser.

One important potential application for broadcasters is for live events e.g. sports events such as the Olympic Games. Live video is also becoming more widely used by non-broadcasters with sites such as Ustream allowing anyone with a webcam to broadcast their own live video stream over the internet.

While there are other standards for interactivity alongside video (e.g. Popcorn on the web and Hybrid Broadcast Broadband TV for internet connected TVs) these haven't yet tackled the problem of true live synchronisation.

Some broadcasters have developed games that can be played on second screens alongside their more popular brands (e.g. Channel4's Million Pound Drop, and BBC R&D's trial with Secret Fortune) but these rely on the interactivity being triggered from a central source with all devices remaining in sync rather than the devices being synchronised to the video itself.

The BBC has a particular interest in live events, which are likely to be an area where national broadcasters maintain their unique role for some time to come. However the BBC also has a keen interest in on-demand media leading the UK market with the iPlayer so it seems natural to find solutions for seamless transition from one to the other which currently do not exist.

What do we need to synchronise with live streams?

To synchronise with a live stream, we need to share a reference clock between the server-side and the client (browser).

Consider how audio and video are synchronised with each other when playing back a media stream. To simplify a little, each audio and video frame has a presentation timestamp in relation to a shared reference clock. On playback, a clock master, usually the sound card which provides a high resolution timer signal (e.g. 44.1 or 48 kHz), is used to calibrate the reference clock and drive the audio and video pipelines. The audio frames are synchronised to this clock master to provide continuous audio while the video frames are served up on a best effort basis (as dropping video frames is less jarring than choppy sound).

When metadata such as subtitles are embedded in an AV container like Matroska or Ogg (using libkate), these metadata are treated in much the same way. They too have presentation timestamps, usually keyed to specific video frames. The main difference with AV frames is that subtitles are discrete events with durations that do not form a continuous timeline. However, they share the same reference clock and are driven by the same clock master.

Why can't we use NTP?

NTP (Network Time Protocol) is a widely used protocol to synchronise a computer's system clock with a remote clock. At first glance, it might seem that this is all we need: make sure the client clock is synchronised to the same reference clock as the server clock and Bob's your uncle, everything is synchronised. Unfortunately, this isn't the case for a number of reasons.

The essential problem is that the server clock is not the same as the stream clock. Even if the server and the stream are synchronised on the server-side (an issue in itself), network latency and buffering will cause unpredictable delays so that by the time the stream reaches the client, it will no longer be in synch with the remote clock. This problem exists for any remote external master clock. Another more subtle problem is that NTP does not guarantee monotonic time - it will occasionally adjust the clock backwards, which would result in stuttering video playback.

Issues with live synchronisation

For on-demand video it's relatively easy to synchronise interactivity as the current playback time is defined in terms of an offset from the start of the media and the client can receive the whole package of timed events in one go.

For live streaming video there are a number of challenges to address:

  • A user can join the stream at any time so they won't have received the history of events that have already taken place which may be critical to the display of the interactive media at that point
  • They may have connected to one of many stream servers which has started streaming at any time in the past
  • Most of the events you want to synchronise with the live media stream may not have happened by the time the user joins the stream so you cannot provide them up front (though you may want to provide expected events such as programme changes in advance)
  • In the majority of cases users will always join part way through a stream as there will nearly always be a back history of events of some kind (e.g. a live programme social media commentary may begin long before the broadcast event begins)
  • For live streaming there is a similar issue when a user hits pause: a client device can be configured to record all events during the time a media stream is paused, but unless it has a way to resynchronise to the same clock used by that media stream it will have no way to resynchronise those events
  • Furthermore, in a production environment, media streams will usually be served by multiple streaming servers which will have been started at different times. So each stream will need its own clock reference

HTML5 timeline

Current implementations of HTML5 media elements on browsers like Firefox and Chrome expose the media timer (via the timeupdate event) so you can attach an event handler and use that to synchronise external timed metadata. The origin of this timeline, in all existing implementations, is time zero. This is fine for discrete fixed size or on-demand media as such media have a definite origin and duration, i.e. they start at zero time and all subsequent times are relative to that start time. This makes it fairly straightforward to synchronise external timed metadata. We can simply stamp the metadata with a time relative to the start of the media, then fire the event at the right time in the browser.

With live streaming media, things are not so obvious. What do we consider to be the zero time of the media? Is it the time we started streaming the media (this is Chrome's current interpretation)? Or is it the time the browser joined the stream (this is Firefox's)? Note that these are both essentially arbitrary times as a stream can be started at any time (due to failover or restarting, etc.) and a browser can join at any time. In either case, how do we convert the relative time into the stream into a time we can stamp on external metadata?

We need to know the time corresponding to the first frame served on that specific stream. Note that there can be more than one streaming server serving the same media for failover or load balancing purposes. So we need a separate time origin for each (they won't all start at exactly the same time).

If we use the system's wallclock time as the reference clock, we can propagate the server time to the client and synchronise events based on the server's clock. This works where we want to synchronise with the server clock but that is a limited use case. More generally, we want to derive the clock reference from the input media (such as a DVB programme reference clock) and share that with the output stream and the metadata. This enables us to synchronise pre-prepared timed events along with timed metadata generated at transmission time (for example in the studio gallery) with the live stream.

The HTML5 media elements specification specifically addresses our use case in the shape of the startOffsetTime attribute.

The startOffsetTime attribute must return a new Date object representing the current timeline offset.

which is defined as:

Some video files also have an explicit date and time corresponding to the zero time in the media timeline, known as the timeline offset.

This sounds great - just what we need. Unfortunately, there's just one snag: of the open source browsers, neither Firefox, Chrome, nor Webkit have implemented it yet.

What we did to implement startOffsetTime

Now, to make this work we need to do two things: 1) provide a timeline offset in the stream on the server side and 2) interpret that offset in the browser codec and make it available to Javascript.

Due to our interest in HTML5, we wanted to use an open container and codecs, which for streaming meant either WebM or Ogg Theora + Vorbis.

WebM is an open video format that marries VP8 video with Ogg Vorbis audio in a Matroska media container format. As a version of the Matroska container format, WebM supports setting an origin for the timeline in the form of the DateUTC field in the header.

Ogg also supports setting a timeline origin in the UTC header field of an Ogg Skeleton bitstream, which is a "logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams".

In practice, neither Firefox nor Chrome actually use this field in either WebM or Ogg. To demonstrate our use case we decided to implement the DateUTC field in WebM because it was simpler than learning how to create and decode a separate logical bitstream in an Ogg container.

WebM DateUTC

To find out how to implement this, we needed to dig into the specifications. According to the WebM specification, DateUTC is 'Supported' so we are on safe ground implementing it.

The Matroska specification describes the DateUTC header field as "Date of the origin of timecode (value 0), i.e. production date." The date type used as defined in the Matroska technical specs is:

Signed, 64-bit (8 byte) integer describing the distance in nanoseconds to the beginning of the millennium (2001-01-01 00:00:00 UTC).

As the HTML5 specification states that startOffsetTime is a Javascript Date object based on the Unix epoch (1 Jan 1970) with a precision of milliseconds, we had to convert between the two standards.

Patching gstreamer's matroskamux

To serve the live streams, we used flumotion, which wraps web-streaming around the gstreamer multimedia framework.

As we are using gstreamer with flumotion to encode into the WebM format, we needed to check the implementation of matroskamux to see how it handled the DateUTC field.

matroskamux is hard-wired to write the current time (i.e. the time at which the component is instantiated) into the DateUTC field. This isn't really what we needed so we modified this component to add a date-utc property which we could set when we started up the encoding pipeline. We also patched it to broadcast each buffer's timestamp via UDP to be picked up by our event server to synchronise the event stream.

While the first modification is generally useful, the latter is really only appropriate to our experimental set up. Ideally, we want to set the timeline origin from a variety of sources, the most useful being the input media stream. A specific example would be propagating the programme reference clock from MPEG-TS input stream. We would like to investigate this in future work.

Firefox patches

On the Firefox side, we needed to implement both the DOM interface to startOffsetTime and the decoder. Implementing the DOM interface was quite straightforward. We were able to read through the implementation of currentTime to see how the DOM connected up with the decoder. For example, the code that reads the startOffsetTime attributes looks like this:

/* readonly attribute double startOffsetTime; */
NS_IMETHODIMP nsHTMLMediaElement::GetStartOffsetTime(double *aStartOffsetTime)
  *aStartOffsetTime = mDecoder ? (mDecoder->GetStartOffsetTime() / 1000.0) : 0;
  return NS_OK;

Changes to libnestegg

The code for reading the DateUTC field is only slightly more complicated due to the difference in epochs between Matroska and HTML5:

int64_t date_utc = 0;
r = nestegg_date_utc(mContext, &date_utc);

if (r == 0) {
  // convert from matroska epoch to unix epoch
  // and nanoseconds to milliseconds
  const int64_t NSEC_PER_SEC     = 1000000000LL;
  const int64_t EBML_DATE_OFFSET = 978307200LL * NSEC_PER_SEC;
  const int64_t NSEC_PER_MSEC    = 1000000LL;

  date_utc += EBML_DATE_OFFSET;
  date_utc /= NSEC_PER_MSEC;

  ReentrantMonitorAutoEnter mon(mDecoder->GetReentrantMonitor());

where nestegg_date_utc is a function we added to read the DateUTC field out of the Matroska header.

Today - proof of concept. Tomorrow...?

It turned out to be quite straightforward to implement startOffsetTime in Firefox and matroskamux. The existing specifications for WebM, Ogg and HTML5 provide the necessary data definitions - we just needed to hook them up together. Being able to set our own origin to the media timeline greatly simplifies the task of synchronising to live media over the web.

Our proof of concept is just a start. We'd like to see the startOffsetTime attribute implemented in all HTML5 compliant browsers and major encoders. However, it's not just a matter of copying a field from a stream into a data structure. We all need to agree on what these bits of data mean.

In the next post, we'll look at how Firefox and Chrome differ in their interpretations of the currentTime attribute and why we think Firefox is right.


  • Comment number 1.

    Thanks for the insight into the browser-based approach for synchronising Internet media with audio/video. I have been involved in similar research, though aimed more at hybrid broadcast-broadband delivery. We synchronise second screen content to broadcast video by means of a timeline component added to the broadcast MPEG2-TS; rather like subtitle insertion. This timeline expresses the progress of time in the ongoing program. By interrogating the rendering device (e.g. set-top box), a second screen application is able to determine the temporal position of the displayed broadcast video and to synchronise its web-sourced content accordingly. Our work has targeted use cases with exacting synchronisation requirements, such as an alternative view on a tablet accompanying a broadcast sport or music event on the TV. More detail may be found in an article that is available here: https://dx.doi.org/10.1109/ICCE-Berlin.2011.6031815.

    Whilst our work has focused on hybrid delivery, it could also be employed for your Internet streaming use case, as the timeline may be transported over IP protocols. What I find appealing about the HTML5 solution, though, is compatibility with the ubiquitous web browser, once startOffsetTime is implemented.

    Here’s looking forward to your next post.

  • Comment number 2.

    Oh, great stuff!

    You might recall my name from the different bugs you linked. I have since stopped my small HTML5 live video streaming business, and came to work for Opera Software instead.

    However, I'm still very interested in startOffsetTime, and a big part of why I started in a browser company was because I found web standards very interesting to follow. ;-)

  • Comment number 3.

    startOffsetTime has been renamed to the much clearer startDate. And it's not needed for most basic live streaming + synchronization now (not involving CDN's and whatnot), because currentTime will give you how long it was since the stream was started.


More from this blog...

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.