Hi, I'm Richard Cooper, the BBC's Controller of Digital Distribution for BBC Future Media.

As many of you will have noticed we suffered a serious incident over the weekend which impacted BBC iPlayer, BBC iPlayer Radio, and audio and video playback on other parts of bbc.co.uk. We also had to use our emergency homepage for prolonged periods of time.

Here’s what happened.

We have a system comprising 58 application servers and 10 database servers that provides programme and clip metadata. This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online. This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres.

At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail.

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.

At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.

Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.

Restoring the metadata service was complex. Isolating the source of the additional load proved to be far from straightforward, and restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems). Performance of the system remained sufficiently poor that in the end we decided to do some significant remedial work on Saturday afternoon, which ran on until the evening. During that period, BBC iPlayer was effectively not useable.

After that work was complete we were in a walking wounded state that allowed close to normal operation for much of the site, though BBC iPlayer remained down on a number of devices. We chose to run it in this mode throughout the rest of the weekend while planning a full restoration of the service. By the time we were ready to do that we were entering the peak period on Sunday evening, so rather than risk the service further, we chose instead to do it on Monday morning.

We recognise that during this incident, with BBC iPlayer unavailable for some periods for some users you may not have been able to watch or listen to the programmes you wanted. I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced. Essentially programmes aired on Saturday 12th July and Sunday 13th July were not available this last weekend for some users. It's small consolation but that was the weekend of the World Cup Final, Scottish Open, Women's Open and other live sporting events which are less likely to be viewed on catch-up. I should also stress that programmes aired this weekend - when the problems occurred - are available now on BBC iPlayer.

BBC iPlayer is an incredibly popular product, last year alone we had 3 billion requests and instances like this are incredibly rare.

We will now be completing the forensics to make sure that we’ve fully understood the root causes, and put in place the measures necessary to minimise the chances of such interruptions in the future.

We're sorry for the inconvenience.

Richard Cooper is Controller of Digital Distribution, BBC Future Media

Tagged with:

Comments

This entry is now closed for comments.

  • Comment number 134. Posted by AlexeiF

    on 9 Aug 2014 12:41

    @133.Graeme Hewson
    I only noticed the absence of Human Planet Ep.2 today when I wanted to watch it. Bunged them an email, but these sort of problems usually take several days for a reply & (sometimes, if at all) solution, so I doubt it'll be viewable by this evening :-(
    Same happened with a recent Saturday morning episode of David Attenborough's The Living Planet, which was "available shortly" for a whole week, & wasn't fixed before the 7 days were up, although they were sent a fault report on the Monday.

    As for The Stuarts Ep.2 in HD, according to
    http://iplayerhelp.external.bbc.co.uk/tv/programme_latest_issues
    "The Stuarts - Episode 2 broadcast on BBC Two on 06/08/2014 does not contain HD version."

    HD/non-HD versions often seem arbitrary - some repeat airings of series (eg The Living Planet again) are a mixture of both. I've also noticed that HD versions are often (though arbitrarily not always) only available for 7 days after initial broadcast, rather than the full availability period of the programme after repeats or with Series Catch-up. In this instance HD was probably available until 2045 on Wed 6th, then reverted to non-HD until 0005 Wed 13th early morning when it expires. Best not to rely on the full stated 'available until' period if you want to guarantee HD-viewing.

    Although I understand the reason for the 30->7day rollback on radio iPlayer, I wish there was more notice on the site; missed Wed-Fri last week as was going to catch-up this wknd :-(

    • This entry is now closed for comments. Number of positive ratings for comment 134: 0
    • This entry is now closed for comments. Number of negative ratings for comment 134: 0
    Loading…
  • Comment number 133. Posted by Graeme Hewson

    on 8 Aug 2014 13:17

    So, how's the root cause analysis coming along?

    I wonder if the BBC could answer these questions (I haven't had replies yet to the questions I submitted through the iPlayer Web form).

    iPlayer says about Human Planet part 2: "This programme will be available shortly after broadcast". - http://www.bbc.co.uk/programmes/b00llpvp - It's been six days now since part 2 was broadcast, and three days since the repeat on Tuesday. Will it be available on iPlayer before part 3 is broadcast tomorrow?

    Also, The Stuarts part 2 - http://www.bbc.co.uk/iplayer/episode/b03tv7f2/the-stuarts-2-a-king-without-a-crown - isn't available in HD, although part 1 was available in HD. Why is this? In fact, I see part 1 isn't available in HD now, either. Why is this, please?

    • This entry is now closed for comments. Number of positive ratings for comment 133: 0
    • This entry is now closed for comments. Number of negative ratings for comment 133: 0
    Loading…
  • Comment number 132. Posted by Brian Wall

    on 7 Aug 2014 08:45

    This may not be relevant to the outage but when going to

    http://www.bbc.com/news/correspondents/davidshukman/

    all the various audio players began at once! As I hadn't yet scrolled down, I couldn't see where the cacophany of multiple Davids was coming from! Revisiting the site it was OK. Using Chrome latest.

    • This entry is now closed for comments. Number of positive ratings for comment 132: 0
    • This entry is now closed for comments. Number of negative ratings for comment 132: 0
    Loading…
  • Comment number 131. Posted by Wemb

    on 7 Aug 2014 00:50

    Oh no, all my favourites have disappeared as of 1am Thur 7th August. Any one else seen this? O4 have any idea as to what to do - as they're now stored on the beeb's cloud - how do I recover them???

    • This entry is now closed for comments. Number of positive ratings for comment 131: 1
    • This entry is now closed for comments. Number of negative ratings for comment 131: 0
    Loading…
  • Comment number 130. Posted by doubtful

    on 6 Aug 2014 22:08

    My listening experience was already destroyed by the 're-vamp', so the crash had far less impact than it should. Rather it was expected. Damage should be anticipated when a monster is released. I'm unsure whether it is an optimistic naivete or just blind stupidity that allowed the release of this particular monster without thorough testing, but the explanations are wearing very thin.

    From the non-techie user's point of view, we had something that was efficient and worked and now we don't. My sincere wish is that someone who can make a difference is listening.

    • This entry is now closed for comments. Number of positive ratings for comment 130: 3
    • This entry is now closed for comments. Number of negative ratings for comment 130: 0
    Loading…
  • Comment number 129. Posted by Karin

    on 6 Aug 2014 17:09

    @Guy (#121) and @cmradtech (#128), as for Radio restarting from zero, the new restart information is probably stored as cookies, which you may be clearing. I don't clear my BBC and other media cookies, as some other media sites store Favourites that way. My programs always remember where they left off, even if that was frustratingly only a minute or so into a (then) badly buffering payback I was forced to abandon. I do wish there were a re-start option, in addition to where I left off.

    As for Favourites on TV, mine worked only on the first week of release! As for Radio and TV Favourites, contrary to misinformed iPlayer Customer Support advice (automated paragraphs from junior 'call centre' staff), iPlayer stores Favourites in a database, as long as you sign in. Only if you do NOT sign in, will it then use cookies, which may get inadvertently cleared or even expire, in theory. Given that you're both posting with IDs, your Favourites should be there, unless those records were lost in the great crash of July 19!

    I did find the non-popping pop-out player bemusing, but I'm not a regular 'popper.' I was only curious as to where I would find this HD thing. It used to pop-out, even on the new Radio Player. Alas, my DAB radio does live at 320kbps (Radio 3), while my ISP seldom does.

    Lastly, following up on the Radio Player home page display issues (me, #113), after more testing, it seems it works (on a PC), only if your browser window is set at about 1040 pixels wide or more, e.g. zoomed to full screen. Otherwise, the expanders fail to 'bounce' into picture mode. In the olden days, we used to cater for 1024x768 to ensure all PCs and laptops could display properly, not just wide-screens. I have mine set at 1033x770 (leaving me just enough room to access other windows of desktop icons on a 17-inch screen). The home page should work on most laptops/PCs. It should have been tested, but doesn't and clearly wasn't. Please fix it. It's embarrassing (or should be).

    @cmradtech (#128), isn't the point of thrash metal at 7 a.m. to shock the system?... Unless you're just getting in, that is (sigh). ;-)

    • This entry is now closed for comments. Number of positive ratings for comment 129: 0
    • This entry is now closed for comments. Number of negative ratings for comment 129: 1
    Loading…
  • Comment number 128. Posted by cmradtech

    on 6 Aug 2014 15:36

    Thanks Guy and Karin for your comments. I hope we can get an official comment on what the length of availably will be. In the absence of information from BBC Karin's "speculation" seems like a pretty good explanation of what happened. Bad planning and lack of testing? I've lost interest in an explanation from BBC about what caused the "Outage" I'd just like them to let us know if there will be any changes like the roll back to 7 days availability before it happens.

    I've been annoyed that the iPlayer doesn't restart from where the program stopped, just restarts from the beginning. Also that the "pop-out" player doesn't pop-out. I thought that these annoyances were a problem with my PCs, but thanks Guy for confirming that it's a problem from the iPlayer. I can put up with these annoyances; having my favourites spread across six pages and even changing back to 7 days availability. I would just like the player to work consistently and for the BBC to let us know if they are going to change something like the length of availability ahead of time. Almost a month after the "Outage" I'm still not sure what will happen when I click on iPlayer!

    And Karin, I have quite a wide musical taste, but I would like the type of music I pick to show up when I ask for it. Expecting Richard Allison and getting Thrash Metal at 7 am is too much of a shock to my system at my age :)

    • This entry is now closed for comments. Number of positive ratings for comment 128: 0
    • This entry is now closed for comments. Number of negative ratings for comment 128: 1
    Loading…
  • Comment number 127. Posted by Karin

    on 6 Aug 2014 12:35

    As I have seen no official explanation as to what the IT forensics have managed to dig out, I offer some further experience-based speculation of my own. An official (and frank) update such as Richard Cooper's, both technical (posted in this blog) and less/non-technical elsewhere on the BBC sites (e.g. BBC Home Page, iPlayer Help, Points of View message board) and discussed in programmes such as Radio 4's Feedback would be most welcome. This blog shouldn't just be about promoting new BBC Web Wonder-toys.

    Unfortunately, there is still fallout from the weekend outage of July 19-21+ (e.g. 'overwritten' content, 7 vs 30 days...)

    First, as to the publishing of Radio 1Xtra content over other Radio programmes, I again very much doubt it is due to digital attack from outside. Rather more likely is that the database and application that manage the ID keys (b0nnxxxxx, p0nnxxxxx, etc.) that are used to tag every 'asset' (image, episode, series, etc.) assembled to build the Web pages fell over. How ever it was restored, records for expectant content (hasn't aired yet) that had been pre-assigned keys may have been lost. In other words, forgetting the keys has already been assigned elsewhere, when newly completed content was ready to go, the system gave out the same keys again. You might click links in a pre-prepared Radio 3 programme page and get 1Xtra content! Think of it as inadvertently bringing the generations (and musical tastes) together. Or not. ;-)

    Second, a LOT was released that one weekend, virtually concurrently, which may have contributed to the database and application systems collapse:
    (a) Commonwealth Games
    (b) BBC Proms (many Web pages were inexplicably late in appearing, although the Proms schedule is published in book form as early as March/April)
    (c) application-based changes due to 30-day extension for all Radio content, plus ensuing cache-size (iPlayer not ours) growth.
    If these changes were not tested (including interactions) and their release carefully scheduled (and duly tested), then releasing them all at once could easily cause problems, both in terms of conflict for resources and expecting something to be 'as it was.'

    When major events such as the Games, the Proms, Glastonbury, etc. are launched, they come with 'mini-sites,' often new (not fully tested?) applications, and even new technologies (recall that Live Streaming debuted with the 2012 Olympics, taking almost exclusive IT priority that summer, regular maintenance went to seed; this year we have the Proms in Surround Sound and even HD [320 kbps?] on the Live Radio Player). As every Web page contains dozens of ID keys, the keys management system and database would be hit tremendously hard with massive publication. I would not be surprised if the ensuing load triggered the (initial) database fallover.

    Of course, once the iPlayer cache went, the databases would have been slaughtered by the millions of content requests directly hammering them! Think of the cache pools as the front line guard. We see slow responses, the DBAs see spikes and possibly fallover. The ability to wall off access to the back-end is paramount, both during releases or substantial cache failure.

    It would be better in future to schedule these releases much better, not all at once. Indeed, as the back-end databases are or should be built for smaller interspersed (as each show finishes airing) updates, with the cache isolating the load of millions of user read requests from them, it would be better if the back-end were taken offline for a scheduled downtime, while such bulk updates were done. We might not be able to access new content for awhile, but presumably the cache (if so architected) could satisfy viewer/listener needs. Personally, I'd rather wait 6 hours than lose 3-4 days.

    Also, if the two data centres are tightly linked, problematic releases merely infect both at the same time. In the absence of a robust test centre, with proper user/update load simulations, could the link not be severed and the new release applied to only one? That's probably a major architectural re-think.

    Third, the 30-day extension would have made things even worse (hard to imagine?), in terms of data-load, disk-space, and cache-strain, if it had continued, so I do understand the precaution of rolling back MOST new content to 7 days. I just wish it were consistent within a series to manage our expectations (and listening plans) better. However, some series, like BBC Proms (and even Commonwealth Games on TV), made a big promotion of the 30-day extension, so all of their episodes (thus far) were given the 30 days, presumably to minimise any backlash.

    To the BBC Trust, other bosses, and the lawyers, when such massive (or indeed smaller) failures occur, could not be appropriate paragraphs in the contracts to ensure that we can still watch/listen to content for an extended period? For example, does your FIFA contract really give a specific date? If so, why does ITV offer us 30 days?


    Now, hopefully the BBC iPlayer folks will have already tested such 'theories' on their own and instituted sensible design, development, testing, and release management going forward. Ideally, someone will post some comforting words to that effect. Or maybe, somewhere in my speculation lie some hints for the forensics folks. Happy hunting!

    Sorry, no Gremlins, space aliens, licence-fee opposing hackers, or other exotic attack, just the need for old-fashioned proper IT. ;-)

    • This entry is now closed for comments. Number of positive ratings for comment 127: 1
    • This entry is now closed for comments. Number of negative ratings for comment 127: 1
    Loading…
  • Comment number 126. Posted by Guy

    on 5 Aug 2014 12:39

    @cmradtech #125
    The 30 day roll-out of programmes started before the outage. The reason they wheren't shown in categories was due to a BBC 'oversight', see #53 @David Morland. As to the return to 7 days though Richard does not say anything here but when on Feedback two weeks ago (25 July) he did say that because of the outage problems they would be rolling-back the 30 days to 7 again. @Karin also mentioned this #112.

    It was unclear on Feedback if this roll-back is only until they have sorted out the outage problems or the 30-day roll-out was one of the causes of it (@Karin #66, point {1} speculated on this). I hope the 30 day roll-out will be restored and I'm sure I'm not the only one but we have still have not been told what the reasons for the outage where. Hopefully they have got to the bottom of it by now and we will be told. There where indications both here and on Feedback we would.

    Incidentally on the website when the outage happened in the UK the BBC homepage was replaced with an emergency homepage but outside the UK the 'normal' homepage (not the same as the UK one) continued. Now this has changed to a very basic and less useful one. I hope this is only temporary.

    • This entry is now closed for comments. Number of positive ratings for comment 126: 0
    • This entry is now closed for comments. Number of negative ratings for comment 126: 0
    Loading…
  • Comment number 125. Posted by cmradtech

    on 2 Aug 2014 16:30

    So now some of my favourites are only available for 7 days. Did BBC only make old programmes available for one month because they were affected by the "outage?" Or has BBC just realised that it can't cope with one month availability and has gone back to 7 days to avoid another "outage?" What is the policy on length of time programmes are made available on iPlayer Radio?

    • This entry is now closed for comments. Number of positive ratings for comment 125: 0
    • This entry is now closed for comments. Number of negative ratings for comment 125: 1
    Loading…
More comments

More Posts

Previous

Next