Main content

Archiving the BBC’s website and social media output

Carl Davies

Service Development and Delivery Manager

BBC Archives has a remit to archive output with historical, cultural and production re-use value. Carl Davies is part of the Digital Archives Services team, and gives an insight into how such a broad range of online content is captured and saved in the archive.

In 2014 I posted a blog, and my colleague Elliot Gibson also wrote an article, detailing how BBC Archives were archiving and preserving the BBC’s web and social media output. With the vast changes to the website, innovations and new ways of the BBC interacting with audiences we felt it was a good time for an update.

Web Archiving

The BBC tries to maintain and keep older pages online for the public to access for as a long as possible (dependant on technology, editorial or copyright reasons). It’s the BBC Archives responsibility to archive copies offline and preserve them. We archived the whole website in 2014, and as most of these pages are static, we don’t need to archive them all again, so we now have a targeted and selective approach each year focused on newly created pages.

In terms of capturing pages from for our web archive collections we currently have a 3 point approach:

  1. WARC Web Crawling: This form of web archiving downloads and preserves the pages in the international standard of “WARC”. These WARC files can then be brought back to life on software, allowing users and researchers to view and interact with the website as if it were ‘live’ (clicking links and browsing). Web crawling captures a point in time, and currently we aim to do a high quality crawl of selected parts of the BBC website once a year.
  2. PDF Web Crawling: In addition to the WARC files, we ensure each page captured also has a PDF, thus we are not solely bound by the WARC technology. That way we can also share PDFs for internal research. As it's a universal method of viewing documents, preservation is more straightforward in the future.
  3. Screencasting: Lastly we look to take a screencast of some of our websites, especially when they have been redesigned, to capture the look & feel of the site in a video (a bit like a software tutorial or a computer game walkthrough you often find on YouTube). This consists of someone recording their screen while browsing a part of the BBC Website, and this walkthrough is then archived alongside our other AV archive collections. Essentially it's an historical record of how the site behaved.

AV, Audio & Images published online

In terms of content published on both the BBC website and the social media platforms, we select almost all unique video and audio. For example, BBC3, Radio 1, and iPlayer Exclusives. We archive AV, audio and images to a high standard so all this content can be associated with all our other Television and Radio archive collections and re-used within the BBC. We also archive new innovations like 360 degree video, and Surround Sound online audio. And finally we archive thousands of images published online.

Social Media Archiving

We aim to archive unique content published on social media platforms. We also archive a selection of the BBC’s Twitter accounts - the actual ‘Tweets’ - as a record of how the BBC communicates with the demographic who use Twitter as a social media tool. Many of these accounts complement the traditional TV & Radio output holdings, but also tell the story of the BBC’s corporate communications. Other institutions and organisations, such as The National Archives have begun archiving Twitter output as a way of archiving cultural memory and current communication methods. Only Tweets (including retweets and replies) from selected official BBC Twitter accounts are captured. Tweets from non-BBC accounts aren’t captured nor are any tweets from the general public. Alongside the BBC generated Tweets any images tweeted by the BBC are also captured. We also archive many of the videos that appear on Facebook & YouTube.

Our Web & Social media archives are maintained and preserved to the very best industry standards. Much of the content is still live online. If unique material is taken down from public view for editorial or copyright reasons, we still ensure we have captured and archived this material for internal re-use or research purposes.

More Posts



Putting audiences at the heart of VR