« Previous | Main | Next »

Recycling clicks to benefit humanity

Post categories:

Luis Von Ahn | 14:04 UK time, Wednesday, 2 September 2009

(Luis von Ahn is Professor of Computer Science at Carnegie Mellon University, developing applications and pograms which harness the combined computational power of humans and computers to solve large-scale problems. The following post is published with kind permission and represents Luis' views; this does not necessarily reflect the views of the BBC or the Digital Revolution production.)

At the height of its construction, 44,733 people worked on the Panama Canal. The Great Pyramid of Giza required 50,000 workers and the Apollo Project 400,000. No matter what you put on this list, humanity's largest achievements have been accomplished with less than a few hundred thousand workers because it has been impossible to assemble more people to work together - until now. With the Internet, we can coordinate the efforts of millions or even billions of humans. If 400,000 people put a man on the moon, what can we do with 400 million? That's the question that motivates my work.
 
An example of this is the reCAPTCHA project, in which hundreds of millions of people have helped digitize books by solving CAPTCHAs on the Internet. CAPTCHAs are widespread security measures that you've all seen: images of squiggly characters on the Web that people must type to obtain free email accounts and access to other sites. By asking humans to do a task that computers cannot, CAPTCHAs prevent automated programs from abusing online services.

For example, CAPTCHAs prevent scalpers from writing programs to buy millions of tickets for concerts or sporting events. It is estimated that over 200 million CAPTCHAs are typed every day, each taking roughly ten seconds of human effort - that's 500,000 hours a day. ReCAPTCHA re-cycles this human mental effort into a dual purpose: transcribing books.

Physical books and other texts written before the computer age are currently being digitized en masse (e.g., by Google Books and the Internet Archive) to preserve human knowledge and make information more accessible. The pages are photographically scanned and then computers must decipher each word in the scanned images in order to index the books and allow people to search through them. Unfortunately, computers are not perfect at deciphering this text. In older prints where the ink has faded, computers cannot recognize about 30 per cent of the words. On the other hand, humans are extremely accurate at doing this.
 
ReCAPTCHA demonstrates that old print material can be transcribed, one word at a time, by people typing CAPTCHAs on the Internet. Whereas the original CAPTCHAs displayed images of random characters rendered by a computer, reCAPTCHA displays words taken from scanned texts that computers could not decipher. The solutions entered by humans are then used to improve the digitization process.

It is important, of course, that the ultimate purpose of clicks online be revealed to the users. Sites using reCAPTCHA display a message that the words entered are being used to digitize books.
 
To date, over 400 million people - 6% of humanity! - have helped transcribe at least one word through reCAPTCHA, making it perhaps the largest example of massive collaboration in the history of humanity.

captcha.jpg

 
Image above: the reCAPTCHA system displays words from scanned texts to humans on the World Wide Web. In this example, the word 'morning' was unrecognizable by the computer. re-CAPTCHA isolated the word, distorted it using random transformations including adding a line through it, and then presented it as a challenge to a user. Since the original word ('morning') was not recognized by the computer, another word for which the answer was known ('overlooks') was also presented to determine if the user entered the correct answer.
 



Comments

  • Comment number 1.

    Interestingly / coincidentally this popped up today - advice on how to game Facebook's Captcha system, replete with mea culpa from the author for teaching people how to undermine an internet good.

    Must be something in the air...

  • Comment number 2.

    1. For whom is all this free work being done? Bill Thompson wrote a fairly compelling argument recently against the whole google books project. Why is one company getting the benefit of everyone's work? Presumably this is one way in which we pay for free service, but in doing so we're cementing the monopoly on knowledge that one organisation is close to achieving. The internet as a brutal free market is one issue, the internet as a winner-takes-all market is another.

    2. Whatever happened to Project Gutenberg? I know they still exist, but why does Google get all the press coverage? Total dominance of the market for journalist's attention? http://www.gutenberg.org

  • Comment number 3.

    Also, the moon shots involved a lot of people spread all across the USA. Communications made this possible, but The Great Wall of China was built over similar distances without such technological assistance. While the internet undoubtably makes things easier, I think it's misleading to ignore the achievments of the past.

    Consider the city of ancient Rome, for instance. Materials and goods were brought thousands of kilometres and assembled to create a city and a culture.

    Is now a good time to look into the issue of self-organising systems? The tendency of apparently chaotic situations, in some circumstances, to result in the creation of a whole that is greater than the sum of the parts? This happens at all levels, from the sub-atomic, through the cellular (resulting in us) to the world around us.

    There is a theory that cities are living entities and we all function as specialised cells within that entity, unaware of it's existence. Is the same thing happening online? http://en.wikipedia.org/wiki/Self-organization

  • Comment number 4.

    Here's the link I was looking for;
    http://www.powells.com/biblio/2-9780684868769-2

  • Comment number 5.

    @TaiwanChallenges 'There is a theory that cities are living entities and we all function as specialised cells within that entity, unaware of it's existence. Is the same thing happening online?' Hugely interesting idea, one that I think Molly (of programme four) would be curious to look into further. Is your suggestion that the web is in some respects a larger organism which is utilising those that see it as their utility; not is some dark, SKYNET sense, but rather in an organic sense of humanity's interactions, interests and niches online creating 'cells' and 'organs' which perform specialist functions throughout the body of the web and internet, keeping it healthy (or in some cases attacking it from within)?

    We're looking for ways of mapping and visualising the web (particularly the web as opposed to the internet, which the DIMES project has done) - is there scope to visualise it in the sense of rather than a 'map' but of an anatomy?

  • Comment number 6.

    Dan, I have no idea what I'm talking about. But your description makes sense.

    Do you know anything about memetics? The idea of information or ideas having 'lives of their own' is really startling but well worth considering. Try http://www.ted.com/talks/lang/eng/susan_blackmore_on_memes_and_temes.html

    And here's a lady talking about anthills being alive in a sense. http://www.ted.com/talks/lang/eng/deborah_gordon_digs_ants.html

    Is a city any different? How about the internet?

    How many connections, how many exchanges of information on the internet? How does that compare to a brain?

    You're presumably aware of the SETI project? What would happen if all that processor power was doing something else when we weren't looking?

  • Comment number 7.

    Something else prompted by the reference to the Pyramids:

    What percentage of the population of ancient Egypt does 50,000 people represent? How about the 400,000 people working on the moonshots?

    Another ancient mega-project I dug up is the Grand Canal in China.
    "Between 1411 and 1415, a total of 165,000 laborers dredged the canal bed in Shandong"
    "By the year 735 it was recorded that about 149,685,400 kg of grain was shipped annually along the canal."
    "Ming Dynasty had to employ 47,004 full-time laborers recruited by the lijia corvée system in order to maintain the entire canal system. It is known that 121,500 soldiers and officers were needed simply to operate the 11,775 government grain barges in the mid 15th century."
    http://en.wikipedia.org/wiki/Grand_Canal_(China)

    In theory, the internet enables us to undertake much bigger projects. But how many people using the internet are actually producers of anything? What population would be required to provide the equivalent of a workforce able to build a canal 1800km long without heavy machinery?

    I read somewhere recently that 90% of internet users are consumers, 9% are minor producers, and the bulk of everything out there is produced by less than 1% of the population. And I'll guess that most of it is replicated, so you have most of your 1% reinventing the wheel.

    On the other hand, maybe that tiny percentage has a disproportionate effect? As long as someone invents the wheel, or facebook, the benefit to everyone is greater than the combined efforts of everyone else.

    Still, I can't help wondering what could be achieved if there was a leader with a vision instead of an emphasis on bottom-up. What would happen if someone with the appropriate stature came forward and set a challenge that would keep our surplus energies and unrealised potentials put to good use for years to come?

    I'm talking mega-mega projects that everyone to contribute to, although I don't know what they would be.

  • Comment number 8.

    I'm talking mega-mega projects that everyone to contribute to, although I don't know what they would be.

    D'oh! Brain is fried after class. Of course I know what would be a good project. I already suggested it:
    http://www.mysociety.org/2009/08/16/historical-data-network/

  • Comment number 9.

    The interesting thing for me about this blog post is that I was totally unaware that this is what recaptcha is used for. I've seen it on a fair number of websites, but never took any notice. It leads me to wonder how many other useful and relevant 'things' to my job have passed right under my nose in the sea of internet info? And, alternatively, what useful nuggets have I missed on page 327 of the search engine results list because the other previous 1500 results have pinged the search engines more frequently and shoved themselves up to the top of the lists, even though they aren't relevant to my search queries at all?

  • Comment number 10.

    @GaryGSCC it is true that usefulness tends to stick around and slowly rise up. So if you miss it and it is truly useful changes are if you need something similar you may encounter it later.

    A while ago I said that Google should use reCaptcha as kudos were kudos is due. Actually everyone should use reCaptcha, however that would mean perhaps slightly less innovation would occur in spam prevention. Competition is good as it drives innovation. However, reCaptcha is a great project and tool, possibly one of the best captcha services around today.

    The point here is that usefulness tends to float up :)

  • Comment number 11.

    @earthgecko I don't disagree with your comment about useful things rising to the top, but I suppose what I want is for them to magically make themselves known and appear at the top of the list straight away. Sometimes I search for something specifically and can't find it... then some time later I'll find the info I wanted ages ago, totally by fluke even though I wasn't looking for it. I know we have the wisdom of crowds on various bookmarking & social network sites that help achieve the immediacy, just by saying 'Hey! Look what I found.', but you need to make sure you're following the right crowd.

 

BBC iD

Sign in

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.