Sports Refresh: Dynamic Semantic Publishing
Hi, I'm Jem Rayfield, and I work as Lead Technical Architect for the News and Knowledge Core Engineering department.
This blog post describes the technology strategy the BBC Future Media department is using to evolve from a relational content model and static publishing framework towards a fully dynamic semantic publishing (DSP) architecture. The DSP architectural approach underpins the recently re-launched and refreshed BBC Sports site and indeed the BBC's Olympics 2012 online content.
DSP uses linked data technology to automate the aggregation, publishing and re-purposing of interrelated content objects according to an ontological domain-modelled information architecture, providing a greatly improved user experience and high levels of user engagement.
The DSP architecture curates and publishes HTML and RDF aggregations based on embedded Linked Data identifiers, ontologies and associated inference.
(RDF - Resource Description Framework - is based upon the idea of making statements about concepts/resources in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource; and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, to represent the notion "Frank Lampard plays for England" in RDF is as a triple, the subject is "Frank Lampard"; the predicate is "plays for" and the object is "England Squad".)
RDF semantics improve navigation, content re-use, re-purposing, search engine rankings, journalist determined levels of automation ("edited by exception") and will in future support semantic advertisement placement for audiences outside of the UK. The DSP approach facilitates multi-dimensional entry points and a richer navigation.
BBC News, BBC Sport and a large number of other web sites across the BBC are authored and published using an in-house bespoke content management/production system ("CPS") with an associated static publishing delivery chain. Journalists are able to author stories, manage indices and edit audio/video assets in the CPS and then publish them pre-baked as static assets to the BBC's Apache web server farm. In addition, journalists can edit and manage content in the CPS for distribution to the BBC Mobile and Interactive TV services, and IPConnected TV services. The CPS has been constantly evolving since it was developed to publish the BBC News website, which launched in November 1997, and the latest version (v6) underpins the summer 2010 redesign of the BBC News site that won the .net "Redesign of the Year".
The first significant move away from the CPS static publishing model by the BBC's Future Media department was through the creation of the BBC Sport World Cup 2010 website.
From first using the site, the most striking changes are the horizontal navigation and the larger format high-quality video. As you navigate through the site it becomes apparent that the rich ontological domain model provides a far deeper way of exposing BBC content than can be achieved through a traditional content management system with its associated relational model and static publishing solution.
Previously, BBC Sport would never have considered creating this number of indices in the CPS, as each index would need an editor to keep it up to date with the latest stories, even where automation rules had been set up. To put this scale of task into perspective, the World Cup site had more index pages than the rest of the BBC Sport site in its entirety.
The DSP architectural approach enables the BBC to support greater breadth and scale, which was previously impossible using a static CMS and associated static publishing chain. DSP allows the BBC to support and underpin the scale and ambition of the recently refreshed BBC Sports site and indeed the Olympics 2012 pages.
The entire football section of the refreshed sports site is orchestrated by automated annotation-powered aggregations. The DSP architecture automatically authors a page for every football team and football competition within the UK in addition to a page for every Olympic athlete (10000+), team (200+), discipline (400-500) and dozens of venue pages.
The number of automated pages managed by the DSP architecture is now well in excess of ten thousand. This number of pages is simply impossible to manage using a static CMS driven publishing stack.
Since the World Cup the DSP architecture has been augmented with a Big Data scale content store (MarkLogic) for managing rapidly changing statistics, navigation and in the future all content objects, thus evolving the architecture completely away from its static publishing roots.
DSP enables the publication of automated metadata and content state driven web pages that require minimal journalist management, as they automatically aggregate and render links to relevant stories and assets.
(Metadata is data about data. In this instance, it provides information about the content of a digital asset. For example, a World Cup story might include metadata that describes which football players are mentioned within the text of the story. The metadata may also describe the associated team, group, or organization associated to the story.)
The published metadata describes the BBC Sport content at a fairly low-level of granularity, enabling rich content relationships and semantic navigation. Querying the published metadata enables the creation of dynamic page aggregations such as Football Team pages or Athlete pages. Published sports stats and navigation are mapped to the ontology and allows dynamic publication of statistics and navigation against automated indices.
The BBC is evolving its publishing architecture towards a model which will allow all content objects and aggregation to be served and rendered on a dynamic request-by-request basis to support rich navigation, state changes such as event or time and, potentially, personalisation; with the information architecture and page layout reacting to underlying semantics and meta model.
The remainder of this post will describe how the BBC intends to evolve the static publishing CPS and the semantic annotation and dynamic metadata publication used for BBC Sport site towards its eventual goal of a fully dynamic semantic publishing architecture.
Static publishing and CPS content management
The CPS has been designed and developed in-house, and so its workflow and process model has evolved to its current form (v6) through continuous iteration and feedback from the BBC journalists who use it. They author and publish content for the product development teams to build the BBC News and Sport websites. When looking at the requirements for the recently redesigned and refreshed News site, the FM department considered evaluating proprietary and open-source solutions in the CMS market for shiny new features.
However the wonderful and interesting thing about the CPS is that most BBC journalists who use it value it very highly. Compared to my experience with many organisations and their content management systems it does a pretty decent job.
The CPS client is built using Microsoft .Net 3.5 and takes full advantage of Windows Presentation Foundation (WPF). The following screen shots of the CPS user interface illustrates some of its features.
Fig 1a: Screen shot of the CPS story-editing window
Fig 1b: BBC CPS, showing the index editor
Figure 1 depicts a screen shot of its story-editing window. The CPS has a number of tools supporting its story editing functions such as managing site navigation, associating stories to indices and others such as search.
As you can see there is a component-based structure to the story content - figure 1a shows a video, an introduction and a quote.
These components are pre-defined allowing a journalist to drag and drop as desired. It is clear that the UI is not a WYSIWIG editor. The current incarnation of the CPS focuses on content structure rather than presentation or content metadata.
Although the editor is not WYSIWIG, CPS content is available for preview and indeed publication to a number of audience facing outputs and associated devices. On publication, CPS assets are statically rendered for audience-facing output - flavours include RSS, Atom, High-Web XHTML, JSON, Low-Web XHTML and mobile outputs.
Fig 2: BBC News CPS static publishing
The static CPS delivery architecture (depicted in Fig 2 above) provides a highly scalable and high performance static content object-publishing framework.
The CPS UI utilises a Windows Communication Foundation data layer API abstraction which proxies the underlying persistence mechanism (an Oracle relational database). The abstracted relational data model captures and persists stories and media assets as well as site structure and associated page layout.
The CPS UI allows the journalist to author stories, media and site structure for preview, eventual publication and re-publication.
A daemon process, the CPS publisher, subscribes to publication events for processing and delivery.
The CPS publisher contextualises content objects in order that they are appropriate for required audience/platform output. Filtered, contextualised assets are rendered by the CPS publisher as a static file per output type. The CPS publisher uses a Model-View-Controller (MVC) architectural pattern to separate the presentation logic from the domain logic.
Each output representation is made persistent onto a Storage Area Network (SAN). The BBC's home-grown content delivery chain subscribes to SAN changes and publishes each output from a secure content creation network onto a set of head Apache HTTPd servers accessible to the audience.
Although the CPS relational content model and static publishing mechanism scales and performs well it has a number of functional limitations. CPS authored content has a fixed association to manually administered indices and outputs are fixed in time without any consideration to asset semantics, state changes or semantic metadata. Re-using and re-purposing CPS authored content to react to different scenarios is very difficult due to the static nature of its output representations. Re-purposing content within a semantic context driven by metadata is impossible without manual Journalist management and re-publishing. Manual complex data management inevitably leads to time, expense and data administration headaches.
The CPS relational data model currently has a very simple metadata model capturing basic items such as author, publish date and site section. Extending the CPS relational content model to support a rich metadata model becomes complex. When designing a knowledge domain annotation schema using a relational approach, one can start by trying to create a flat controlled vocabulary, which can be associated to content objects. However, this quickly breaks - as semantics are very unclear. Evolving this further, a flat controlled vocabulary can be grouped into vocabulary categories; nevertheless, a restrictive and hierarchal taxonomical annotation schema soon evolve again. As concepts need to be shared this gives rise to vocabulary repetition and ambiguity. A taxonomic hierarchy further evolves into a graph, allowing concepts to be shared and re-used to ensure that semantics are disambiguous and knowledge is concise.
Implementing a categorised controlled vocabulary within a relational database introduces complexity; creating a hierarchy introduces further complexity, and implementing graph theory within a relation model takes things past the useable limits of a relational model. If you then add in requirements for reasoning based on metadata semantics then relational databases, associated SQL and schemas are no longer applicable solutions and are simply redundant in this problem space.
Dynamic Semantic Annotation Driven Publishing
The primary goals of the BBC World Cup 2010 web site were to promote the quality of the original, authored in-house BBC content in context and to increase its visibility and longevity by improving the breadth and depth of navigational functionality.
Increasing user journeys through the range of content while keeping the audience engaged for longer browser session durations meant that a larger more complex information architecture was required than that traditionally managed by BBC journalists.
Creating a website navigation for 700+ Player, Team, Group and Match pages posed a problem as the traditional CPS manual content administration processes would not scale. An automated solution was required in order that a small number of journalists could author and surface the content with as light a touch as possible; and automatically aggregate content onto the 700+ pages based on the concepts and semantics contained within the body of the story documents.
Fig 3: Dynamic RDF automated England index
The information architecture gave rise to a domain model which included concepts and relationships such as time and location; events and competitions; groups, leagues and divisions; stages and rounds; matches; teams, squads and players; players within squads, teams playing in groups, groups within stages, etc.
Clearly, the sport domain soon gives rise to a fairly complex metadata model. When you then include a model that describes the assets that need to be aggregated with a semantic association to the sport domain, it is quickly apparent that using a relational database is not an appropriate solution. The BBC needed to evolve beyond a relational CPS static architecture.
The DSP architecture and its underlying publishing framework do not author content directly; rather it publishes data about the content - metadata. For the World Cup, the published metadata described the content at a fairly low-level of granularity, providing rich content relationships and semantic navigation. By querying this published metadata we were able to create automatic dynamic page aggregations for Teams, Groups and Players.
The foundation of these dynamic aggregations was a rich ontological domain model. The ontology described entity existence, groups and relationships between the things/concepts that describe the World Cup. For example, "Frank Lampard" was part of the "England Squad" and the "England Squad" competed in "Group C" of the "FIFA World Cup 2010".
The ontology model also described journalist-authored assets - stories, blogs, profiles, images, video and statistics - and enabled them to be associated to concepts within the domain model. Thus a story with an "England Squad" concept relationship provides the basis for a dynamic query aggregation for the England Squad page "All stories tagged with England Squad" (Figure 3). The required domain ontology was broken down into three basic areas asset, tag and domain ontologies (Figure 4) forming a triple, thus allowing a journalist to apply a triple-set to a static asset, such as associating the concept "Frank Lampard" with a story "Goal re-ignites technology row".
The tagging ontology was kept deliberately simple in order to protect the journalist from the complexities of the underlying domain model. A simple set of asset/domain joining predicates, such as "about" and "mentions", drive the annotation tool UI and workflow, keeping the annotation simple and efficient, without losing any of the power of the associated knowledge model.
Fig 4: The Asset (left), Tag (middle) and Domain (right) Ontologies used in the World Cup 2010, simplified for brevity
In addition to a manual selective tagging process, Journalist-authored content is automatically analysed against the domain ontology. A natural language determiner process automatically extracts concepts embedded within a textual representation of a story. The concepts are moderated and, again, selectively applied before publication. Moderated, automated concept analysis improves the depth, breadth and quality of metadata publishing.
The following screen shots describe the process of content annotation.
Fig 5a: A journalist, using the Graffiti tool, applies the sport concept "Gareth Barry" to a story about the footballer
Fig 5b: Annotating a story with the location Milton Keynes in the Graffiti tool
The journalist applies suggested annotations as well as searching for triplestore-indexed concepts.
As you can see all ontology concepts are linked to linked open data (LOD) identifiers (DBPedia, Geonames etc.). ("Linked open data" describes a method of exposing, sharing, and connecting data via derefenceable URIs). This allows a journalist to correctly disambiguate concepts such as football players or geographical locations.
Journalist-published metadata is captured and made persistent for querying using the resource description framework (RDF) metadata representation and triple store (BigOWLIM) technology.
Fig 6: Semantic World Cup 2010 publishing, powered by a triplestore
Figure 6 depicts the dynamic semantic architecture built to publish metadata driven static asset aggregations. A triple-store (RDF metadata database) and SPARQL (RDF query language) approach was chosen over and above traditional relational database technologies due to the requirements for interpretation of metadata with respect to an ontological domain model.
The high-level goal is that the domain ontology allows for intelligent mapping of journalist assets to concepts and queries.
The chosen triple-store provides reasoning following the forward-chaining model and thus implicitly inferred statements are automatically derived from the explicitly applied journalist metadata concepts.
For example, if a journalist selects and applies the single concept "Frank Lampard", then the framework infers and applies concepts such as "England Squad", "Group C" and "FIFA World Cup 2010" (as generated triples within the triple store). Thus the semantics of the ontologies, the factual data, and the content metadata are taken into account during query evaluation. The triple-store was configured so that it performed reasoning with the semantics of all this data - at real time, hundreds of updates per minute while millions of concurrent requests occur against the same database.
This inference capability makes both the journalist tagging and the triplestore powered SPARQL queries simpler and indeed quicker than a traditional SQL approach. Dynamic aggregations based on inferred statements increase the quality and breadth of content across the site. The RDF triple approach also facilitates agile modelling, whereas traditional relational schema modelling is less flexible and also increases query complexity.
The BBC triple store is deployed multi-data centre in a resilient, clustered, performant and horizontally scalable fashion, allowing future expansion for additional domain ontologies and if required, linked open data sets.
The REST API is accessible via HTTPs with an appropriate certificate.
The API is designed as a generic façade onto the triple-store allowing RDF data to be re-purposed and re-used pan BBC. This service orchestrates SPARQL queries and ensures that results are dynamically cached with a low, one minute 'time-to-live' (TTL) expiry cross data centre, using memcached.
All RDF metadata transactions sent to the API for CRUD operations are validated against associated ontologies before any persistence operations are invoked. This validation process ensures that RDF conforms to underlying ontologies and ensures data consistency. The validation libraries used include Jena Eyeball. The API also performs content transformations between the various flavours of RDF such as N3 or XML RDF.
Automated XML sports stats feeds from various sources are delivered and processed by the BBC. These feeds are now also transformed into an RDF representation. The transformation process maps feed-supplier IDs onto corresponding ontology concepts, and thus aligns external provider data with the RDF ontology representation within the triple store. Sports stats for Matches, Teams and Players are aggregated inline and served dynamically from the persistent triple store.
The dynamic aggregation and publishing page-rendering layer is built using a Zend PHP virtual machine and memcached stack.
The PHP layer requests an RDF representation of a particular concept or concepts from the REST service layer based on the audience's URL request. So if an "England Squad" page request is received by the PHP code several RDF queries will be invoked over HTTPs to the REST service layer below.
The render layer will then dynamically aggregate several asset types (stories, blogs, feeds, images, profiles and statistics) for a particular concept such as "England Squad". The resultant view and RDF is cached with a low TTL (one minute) at the render layer for subsequent requests from the audience. The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page.
The World Cup made use of existing infrastructure utilising the significant number of existing static news kit (apache servers, HTTP load balancers and gateway architecture) all HTTP responses are annotated with appropriate low (one minute) cache expires headers. This HTTP caching increases the scalability of the platform and also allows Content Delivery Network (CDN) caching if demand requires.
The DSP architecture served millions of page requests a day throughout the World Cup with continually changing OWL reasoned semantic RDF data. It served an average of a million SPARQL queries per day for the duration of the tournament, with a peak RDF transaction rate of hundreds of player statistics per minute. Cache expiry at all layers within the framework is one minute enabling a dynamic, rapidly changing domain and statistic-driven user experience.
Sport Refresh and Olympics Dynamic Publishing
The refreshed BBC Sports site is currently served to the audience using a combination of the two architectural approaches previously described: static publishing and DSP. The parts of the Sports site which are published using DSP or static publication are visible to the audience - the flavours of URL show which system publishes the page.
The refreshed BBC Sports site mashes static and dynamic published assets onto statically published pages via a server side include mechanism. This enables the BBC to migrate a proportion of its content onto the DSP architecture in a gradual phased manner. The end goal is that the static publication chain can be retired.
Assets which are published via the static publication chain are exposed to the audience via URL's which are prefixed with http:// www.bbc.co.uk/sport/0/. For example:
- Sport Home Page: http://www.bbc.co.uk/sport/0/
- Football Index: http://www.bbc.co.uk/sport/0/football/
- Golf Index: http://www.bbc.co.uk/sport/0/golf/
- Football Story: http://www.bbc.co.uk/sport/0/football/17088995
Fig 7: The statically published BBC Sport Home page(Including dynamic navigation and dynamic sport statistics)
The CPS powered static publishing mechanism is currently used to curate, author, manage and publish BBC sports stories and editorially curated indices such as the main sports index and football index.
These assets are hand crafted, content managed, orchestrated and published by journalists.
When these Sports site pages are statically published they include and combine references to dynamic content. These references, known as server side includes (SSI), are resolved at render time at the apache web server farm. (SSIs are part of a simple interpreted scripting language which allows content from one or more sources to combined into a static web page.)
The mainly static pages then combine dynamic content such as statistics and navigation into a single page output for consumption by the audience. A static story combined with dynamic navigation and dynamic statistics would be a good example of this mixed publication chain approach. The cacheable proxied SSI mechanism mashes together the content from the static platform and dynamic platform allowing a phased migration towards a fully dynamic BBC sports site.
Automated annotation driven aggregation pages such as Football Team, Olympic Athlete, Olympics Discipline, and Olympics Venue are powered using the DSP approach. These pages are fully automated requiring no content management or journalist content management overhead. These pages do not contain any static content; they are fully dynamic and contain only references to static content objects such as stories or videos.
Journalists annotate BBC content objects such as a sports story or a video with concepts such as an athlete or a football team. Content objects are then automatically aggregated onto pages published using the newer DSP stack. For example:
- Chelsea Football Club: All the content objects associated to the concept "Chelsea"
- Tom Daley: All the content objects associated to the concept "Tom Daley"
- Team GB: All the content objects associated to the concept "Team GB"
Fig 8: The Chelsea FC team dynamic BBC Sport page including automated metadata aggregations, dynamic sports stats and dynamic Sport navigation.
The navigation and sports statistics contained on this page are rendered on a request-by-request basis from the underlying XML content Store (MarkLogic).
The story, video, comment and analysis assets contained on this page are rendered on a request-by-requests basis from the underling RDF store (BigOWLIM).
Fig 9: The BBC Sport ontology as applied to Olympics 2012 Track Cycling
As you can see the model defines a simple yet generic sport ontology, which is capable of modelling sports from Football to the Men's Cycle Sprint within the Olympics 2012.
All the DSP powered pages on the sport site use this ontology model as its foundation. A simple asset model describing assets such as stories and videos linked to the Sport domain representation allows very rich dynamic content object aggregation.
The DSP's Natural Language processing and concept suggestion tool, which powers the Graffiti annotation tool, is now ontology aware. When additional concepts are added into the triple store (for example a new athlete) these concepts are immediately suggested to the Journalist as concepts for annotation. This feedback loop ensures that changes in the ontology instance data are reflected in all components of the DSP architecture.
Fig 10: Ontology aware natural language processing and annotation suggestion
The refreshed BBC Sport site's horizontal navigation is powered by a content model, which links ontology concepts to navigation entries.
This allows navigating to and automatically aggregating content from navigation linked to metadata concepts.
The underlying navigation data and associated content model are stored within a new addition to the DSP architecture - a highly scaled and high performance fault tolerant Big Data Store namely MarkLogic.
Sports statistics provided by third party suppliers are also now stored as XML content within this query-able Content Store. The BBC sports site queries these XML fragments adds value and re-formats the statistics in a form consumable on the sports site.
The Content Store which currently powers all of the statistics and navigation on the sports site has been scaled to handle ingesting many thousands of content objects per second whilst concurrently supporting many millions of dynamic page renditions and impressions a day. This high performance content store will allow the BBC Sports site to ingest and render sport statistics including live football scores, live football tables, live Olympics event statistics and results in near real-time whilst rendering this content dynamically using the DSP approach.
The DSP's triple store will be used in a purer sense and will now only be concerned with domain and asset metadata - it will not persist or manage content object data.
This clear separation of concerns makes the DSP persistence mechanism scalable.
Metadata is stored within a persistent RDF store suitable for modelling rich graphs. Content objects are stored within a document store suitable for live ingest and rendering.
A clean domain model, which only contains references to unique content objects, allows the content model to evolve and also allows the content to be stored in a de-coupled fashion. As long as the content has a unique identifier which is addressable the asset->tag->domain RDF model allows the triple store to model extendable real work concepts and lets the content store model raw referenced assets.
The Sport RDF currently maps third party statistic identifiers from the sport ontology concepts into sport content objects. This allows querying across the triple-store and content store for sports statistics related to a sport concept e.g. "The league table for the English Premiership".
Fig 11: Dynamic Content Store powered sports statistics
Content objects and sports statistics can then be cut up and arranged on a personalised, metadata driven, request-by-request basis.
The Olympics 2012 sports statistics are to be ingested and delivered to the audience using the same content store and dynamic render architecture. Statistics will be supplied from every Olympics event and venue for every event within the Olympics. These statistics will be ingested in near-real time for inclusion on metadata driven pages and video feeds. This gives the BBC's online Olympics output a very real sense of live.
The triple-store and content store are abstracted and orchestrated by a REST API. The API will continue to support SPARQL and RDF validation but it will now support XQuery and XML persistence across both the triple-store and the content store.
This allows a content aggregation to be generated using a combination of SPARQL for domain querying and XQuery for asset selection. All content object and metadata are made persistent in transactional manner across both data sources.
The content API "TRiPOD" (Figure 12) makes use of a multi-data centre memcached cluster to store content aggregations and protect the triplestore and content-store from query storms. The API cache is split into a live cache with a typically low cache profile circa one-minute TTL and a second, longer stale cache with an expiry TTL of 72 hours.
Memcache is also used to control SPARQL/XQuery invocation using a memcache-based locking strategy.
If the live cache has expired a lock is created and a single query invocation thread per data-center is invoked. Subsequent requests are served from stale until the query responds refreshing both the stale and live cache. This caching and locking strategy enables the DSP platform to scale to many millions of page requests and associated backend queries a day.
Fig 12: DSP architecture combining SPARQL/XQuery, RDF store, and XML Store
The Future: Fully Dynamic Publishing
Although the BBC Sport architecture enables static asset content aggregation and re-purposing based on dynamic triple-store RDF metadata it currently does not support dynamic editorial authored asset rendering.
Assets such as stories are currently statically published rendering them fixed and immutable.
The refreshed BBC Sports site will eventually require content objects to be cut-up, arranged and rendered with respect to state changes and persona.
The ability to render all content object fragments by state and indeed metadata concepts will enable the BBC Sport web site to facilitate personalised, event driven pages with greater flexibility than that currently achieved for the BBC sport web site. A re-usable content API which contextualises content objects for device and platform will enable the BBC to create new outputs and open the BBC archive to the public.
The DSP architecture (Figure 6) will now take a final evolution - deprecating the static, fixed asset publication in preference for dynamic content object renditions.
Content objects will be dynamically rendered on a request-by-request basis rather than 'fixed-in-time' static publication.
Textual content objects such as stories and editorially authored indexes such as the football home page will be made persistent within the schema independent content store.
The content store supports fine-grained XQuery, enabling search, versioning, and access control.
All editorially authored content objects such as stories and manually managed indices will also be stored within the content store.
The content store is horizontally scalable and allows content to be handled in discreet chunks, supporting the cutting up and repurposing of fine-grained content. Each content object within the content store will be modelled as a discrete document with no interrelationships.
Discrete content objects are to be modelled and referenced via the asset ontology RDF within the triple-store.
Triple-store SPARQL is used to locate, query and search for documents by concept providing all the aggregation and inference functionality required.
The content store is used for fast, scalable queryable and searchable access to the raw content object data while the triple-store continues to provide access to asset references and associated domain models.
The Graffiti annotation tool UI currently only makes it possible for a journalist to annotate static content objects post-publication; it does not integrate with the CPS UI.
Using the Graffiti API within the CPS UI will soon unify and rationalise the journalist's toolset. Merging the Graffiti UI into the CPS UI will provided a single UI for the journalist, supporting the creation and annotation of documents within a single view.
Real-time concept extraction and suggestion will occurr as the journalist authors and then publishes content.
The DSP platform caching approach is fundamental to enable a scalable and performant platform. The API memcache strategy is augmented with HTTP caching between the PHP render layer and the API. The PHP layer also makes use of memcache for page module caching; all page fragments are cached at a Varnish ESI page assembly layer with corresponding HTTP caching. The site as a whole is also edge-cached for further scalability and resilience during very large traffic spikes.
A technical architecture that combines a document/content store with a triple-store proves an excellent data and metadata persistence layer for the BBC Sport site and indeed future builds including BBC News mobile.
- A triple-store provides a concise, accurate and clean implementation methodology for describing domain knowledge models.
- An RDF graph approach provides ultimate modelling expressivity, with the added advantage of deductive reasoning.
- SPARQL simplifies domain queries, with the associated underlying RDF schema being more flexible than a corresponding SQL/RDBMS approach.
- A document/content store provides schema flexibility; schema independent storage; versioning, and search and query facilities across atomic content objects.
- Combining a model expressed as RDF referencing content objects in a scalable document/content-store provides a persistence layer that uses the best of both technical approaches.
This combination removes the shackles associated with traditional RDBMS approaches.
Using each data store for what it is best at creates a framework that scales and is ultimately flexible.
Replacing a static publishing mechanism with a dynamic request-by-request solution that uses a scalable metadata/data layer will remove the barriers to creativity for BBC journalists, designers and product managers, allowing them to make the very best use of the BBC's content.
Simplifying the authoring approach via metadata annotation opens this content up and increases the reach and value of the BBC's online content.
Finally, combining the triple approach with dynamic atomic documents as an architectural foundation simplifies the publication of pan-BBC content as "open linked data" between BBC systems and across the wider linked open data cloud.
Jem Rayfield is a lead architect in BBC Future Media, specifically focusing on News, Sport & Knowledge products.