Main content

Photo by Camilo Jimenez on Unsplash

You’re browsing social media and you see something that doesn’t look right. What do you do? You can search for information about the source or the material itself but you may not find the answer. What if there was a technology-based solution that could help – some kind of signal that would reassure you that what you’re seeing hasn’t been tampered with or misdirected?

This question has been posed by many organisations and individuals in the last couple of years as the scourge of disinformation has grown. Now Project Origin, a collaboration involving the BBC, the CBC/Radio Canada, Microsoft and The New York Times is working on a solution.

Essentially, we are seeking to repair the link in news provenance that has been broken by large-scale third-party content hosting. What do we mean by broken provenance? Most large social media platforms have features such as verified pages or accounts but outside of these, there are countless re-posts of content that was originally published by another person or organisation. In some cases, this content is simply re-uploaded and shared. In others, a re-upload is accompanied by some new context. Users also modify the content - for humour, for brevity, and in some cases, with malicious intent.

Photo by Marvin Meyer on Unsplash

Our objective is to derive signals with respect to content coming from publishers or originators to allow consumers to be reassured about its source and the fact that it has not been manipulated. It’s a huge task and we’re very much aware that others are doing excellent work in this space, as well as in the wider disinformation sphere. The Content Authenticity Initiative, for example, has carried out some excellent work, focusing, in the first instance, on securing the provenance of images from the point of capture.

We’ve divided the problem into three main areas - giving the content item an identifier, finding a way to allow it to take that identifier with it on its journey and safely storing the information that will allow it to be checked.

Firstly, each digital image, video or audio file is represented by a very specific sequence of bits, so specific that we can safely identify even the smallest differences from the content that was originally produced. These sequences of bits are, understandably, enormous, but thankfully, we can lean on a concept called cryptographic hashing as a way of allowing us to represent them as a short string through secure hash algorithms. We can be confident that there is effectively a zero probability that two pieces of content share the same hash.

To know who generated the content hash, we need another tool – a key. Public-private asymmetric keys are in common use on today’s internet – helping us carry out e-commerce amongst other things. They allow a publisher to digitally sign a document which is linked to a piece of content – containing for example data about the content and the hashes that represent it - by creating something we call a manifest. Again, maths is our hero here with some complex cryptography ensuring that only the person with the private key could have signed the manifest and this can be verified using the corresponding public key.

The way a browser on a PC knows that this signature is bona fide is via a piece of standard internet functionality provided by a Certificate Authority – a trusted third party that checks the public key it’s being offered belongs to the right party.

Finally, at the heart of a provenance system we need a way of maintaining a reliable and consistent database of manifests. For Origin we plan to use the Microsoft Confidential Consortium Framework (CCF) as the heart of the manifest and receipt storage. The Provenance System built around this to deal with the various media registration and queries will be based on Microsoft’s AMP (Aether Media Provenance)

Unlike the permissionless blockchain solutions made famous by cryptocurrency, CCF is a ‘permissioned’ system. These are sometimes called ‘green blockchains’ since they do not need to consume large amounts of energy to determine consensus – there is enough trust between parties controlling the system to allow the nodes to act on a much simpler basis.

To sum up, we are developing a machine-readable way of representing data about a content item in a way that allows a publisher to tie or ‘bind’ the specific content item to the data and have it stored safely for future retrieval by a user.

So what’s next? On the technology front we’re determining how to ensure that the content, its manifest and the cryptographic binding – the signed hashes and certificates that link the content you have to the details - are all conveyed together. We're also working on what to do when data is not present or has been altered. What happens for example if content has been clipped or transcoded in a useful and legitimate way?

We’re also keen to determine how this kind of technology can help in a wider media and technology community where there are many tools operated by a range of different organisations. An important element of our work has been trying to understand the APIs or common interfaces that might be standardised so a single device can discover and query different systems - including those used for content creation. And we’re launching a formal standards effort to define APIs and systems specifications for media provenance across the whole media ecosystem.

More Posts


You don't know what you don't know