BBC R&D

Posted by Libby Miller on , last updated

ViSTA-TV is a EU-funded research project which is applying real-time data mining techniques to anonymised IPTV and Radio audience and programme data.

One of the applications developed in the project is "Infinite Trailers", a very simple interface to iPlayer that behaves like a continuously playing channel of BBC TV videos.

The user can choose to watch whatever is suggested, or skip the current programme to get better suggestions. The concept is like flicking between channels but with extra intelligence built in, to help the user find something interesting to watch quickly.

We recently ran a user test on this application which produced conclusions that in retrospect are obvious, but weren't to us at the start of the test:

People don't like to be shown (or suggested) the same programme over and over again

With a limited number of programmes available at any one time and a degree of personalisation, that's what tends to happen.

In this post, I summarise the results of the test, and why we think this result is important. More details are in a forthcoming white paper.

The Prototype

In the section I work in within BBC R&D (Internet Research and Future Services), we work by prototyping ideas with real data in order to demonstrate them to internal and external stakeholders and test and evaluate them with end users. Our philosophy is to prototype and iterate, so the prototypes people see may not be complete or fully functional, but should help us learn whether the idea is successful and what we need to do to build more stable versions.

The version of Infinite Trailers we tested was the second version of the application. The first was built in a week, reusing infrastructure developed in the MyMedia and Snippets projects, and showed the power of the idea but inevitably could not be supported for long. The second version had to be robust enough to be tested with end users, and so we used the BBC's 'Standard Media Player', which finds the most appropriate format for a client's capability.

The result is a web application, which plays a succession of available iPlayer programmes in full. The initial list is what is most popular at the time.

The user interacts with one button - 'next' - which can be pressed at any time. Clicking on 'next' causes a switch to a different programme based on the user's past skipping and watching behaviour. The user interface is deliberately very simple and televisual.

It records 'like' after 30s of watching. It records 'dislike' if 'next' is pressed. This data is fed to a recommendations algorithm to generate the next item shown to the end user.

Personalised TV Channels

The intention was to make the experience linear TV, where people click between channels until they find something interesting to watch immediately.


Conceptually, there is a branching tree of possible programmes. In the first instance a programme plays from near the start. If the user does nothing, the programme plays in full and the next programme plays after it, and so on, travelling along the top line of the triangle in the sketch above. In our case, this initial top path consists of the most popular on-demand shows right now, ordered by popularity.

If the user clicks the 'next' button, the tree branches and they are placed on a different path through it. Each path acts like a personalised channel, and the more the user interacts with it, the more personalised the channel becomes, based on whatever algorithm is used to cluster programmes.

Drop-in, Privacy-preserving Recommendations

Infinite Trailers has the feature that any measure of similarity between pairs of programmes within a fixed set can be used to produce recommendations by modifying a Javascript file. This is a feature originally used in Sibyl, a recommender developed by Chris Newell. This feature makes it trivial to test different recommendations algorithms with an A/B test.

Two recommendations algorithms were tested: the Basic Metadata algorithm and the Enriched Metadata algorithm.

The Basic Metadata algorithm used attributes obtained from the BBC's programme scheduling database. The attributes consisted of one or more hierarchical genres (e.g. "drama/crime") and a programme format (e.g."documentary'') for each programme. For some programmes there were also people, location and subject tags but the distribution and availability of these was not consistent between the different programmes.

The Enriched Metadata algorithm used all the attributes used by the Basic Metadata algorithm with additional attributes obtained from external sources.

The User Test

We identified two goals for the test:

  • To test a recommendations algorithm produced within the ViSTA-TV project against a neutral alternative in a realistic context (the hypothesis is that the Enriched Recommender will perform better than the Basic Recommender, because similarity will be based on more data points)
  • To test whether remembering user choices over time results in more satisfaction with the application (the hypothesis that ''Warm start'' is better than ''Cold Start'')

The first of these was as a test of one of the outputs of the ViSTA-TV project. The second of these addresses an important additional aspect of recommendations: If we have some information about a person's past behaviour, is it always a good idea to use that information for recommendations? Can we get evidence one way or another?

The test was held for 6 days, from March 25th to March 30th 2014, with 46 participants. Half were women, half men; the group had a range of different ages, and were from a range of backgrounds and locations within the UK. They were asked to use the system for at least 5 minutes per day and try to find something to watch. Their behaviour was tracked (with their consent) and they also filled in questionnaires.

Qualitative Results

The main conclusion from the questionnaires was that although people like their preferences to be recognised, variety is also important.

As preferences are narrowed down, given the fixed quantity of programmes the BBC has available in any given period of time, the choices presented can become repetitive. There was some indication that the Enriched metadata-based recommendations strategy produced more repetitive results, as does saving preferences over time (more on thid below).

Overall, a large majority of participants would recommend the application to others, and found programmes they would not otherwise have watched with it.

Some immediate interface improvements were suggested, particularly a 'back' button.

Quantitative results

The quantitative results back up the conclusions from the qualitative results, but in a slightly unexpected way. Notably, peoples' perceived (qualitative) experience improved under the Enriched recommendations algorithm, and fewer presses of "next" before they found something to watch - but in fact they watched around 40% fewer minutes than with the Basic recommendations algorithm. Similarly, there was a pronounced drop in minutes watched under warm start compared with under cold start. In terms of what they actually did, then, Basic recommendations and Cold start performed better.

Previous research suggests that this may be connected to the diversity of programmes presented to the user.

This makes intuitive sense: with a fixed set of programmes, improving the accuracy of recommendations reduces the number of options available, so that although it might be quicker to find something to watch, the options presented to the end user are likely to be less diverse - there are simply fewer available programmes to suggest.

One simple measure of diversity is to measure the number of repeats - what proportion of the of programmes had been presented to the test subjects more than once. For the Basic recommender around a third of programmes presented to the user had been presented to them before; and for the Enriched recommender, around half had, suggesting that diversity of programmes may provide at least some of the explanation of the difference in the number of minutes watched under the different recommendations conditions.

We can do a similar analysis for the cold and warm start conditions. However in this case, there are no clearcut results - in fact for one group, warm start resulted in fewer repeats. This is counter-intuitive because over time, we might expect the number of available programmes to reduce as the system learned the users' preferences.

A possible explanation for with is an accidental feature of the design of the experiment. The experiment started with what was most popular every day. The result of that was that in the cold start condition, certain brands and programmes were presented as a starting point on consecutive days. This meant that when preferences were persisted (warm start) the starting point on each day was more diverse.

These results suggest further research questions:

  • Is there a particular point at which users become frustrated with particular algorithms because of repetition?
  • Are the better measure of diversity / variety we could use?
  • Is there a optimum choosing period where more dislikes mean that the programme selected is a better choice for the user? Would a back button make a difference to this?
  • Does spending more time choosing result in a better result for the user?
  • Is is better to find something quickly or spend more time watching overall? What counts as success?

They also suggest that any future test should choose a different starting point for measuring the usefulness of persisting preference data over time, such as a random selection of programmes.

Summary

This experiment placed 46 people into four different groups, in order to test:

  • The success of a recommendations algorithm developed in ViSTA-TV against a neutral alternative in a realistic context.
  • Whether remembering user choices over time results in more satisfaction with the application (the hypothesis that ''Warm start'' is better than ''Cold Start'').

We also wanted to discover the testers' views on the usefulness and usability of the application.

We found, firstly, that a large majority of users enjoyed using the application and would recommend it to others; and also that a large majority of users watched new programmes they wouldn't otherwise have watched.

Secondly, that a less diverse recommendations algorithm led to substantially fewer minutes watched on average, and that in this case an algorithm that produced by one measure better results (greater proportion of likes to dislikes) was correlated with fewer minutes watched and programmes watched for a shorter period as well as less diverse programmes.

Thirdly, that with respect to the cold start and warm start part of the experiment, the design of the test was partly flawed, because the cold start starting condition was what was popular, and popular programmes have a tendency to be persistent over time and not diverse.

This was a small, short, test, but what's interesting and important about it is that it gives us some evidence that personalisation alone may not be the best strategy to get people to watch programmes, and that diversity of programmes presented to people might be key to their enjoyment of recommendations.

Acknowledgements

Thanks to Lianne Kerlin, who helped a huge amount with the analysis; to Andrew Nicolaou and Chris Newell who provided the elegant and reliable technical infrastructure, together with Valentina Maccatrozzo (Vrije Universiteit Amsterdam); to Joanne Moore for reading over an earlier version of this and providing very helpful comments, and to Dan Nuttall, Ant Onumonu and Andrew Wood who worked on the initial version of the inifinite trailers prototype.

We'd also like to thank all the participants in the various workshops we've held as part of the project, and all our user testers.