reCAPTCHA
From Wikipedia, the free encyclopedia
reCAPTCHA is a system developed at Carnegie Mellon University that uses CAPTCHA to help digitize the text of books while protecting websites from bots attempting to access restricted areas.[1][2] reCAPTCHA is currently digitizing text from the Internet Archive and the archives of the New York Times.[3]
reCAPTCHA supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects. This provides about the equivalent of 160 books per day, or 12,000 manhours per day of free labor (as of September 2008[update]).[1]
The system is reported to deliver 30 million images every day (as of December 2007[update]),[4] and counts such popular sites as Facebook, TicketMaster, Twitter and StumbleUpon amongst subscribers.[5] Craigslist began using reCAPTCHA in June 2008.[6] The U.S. National Telecommunications and Information Administration also uses reCAPTCHA for its digital TV converter box coupon program website as part of the US DTV transition.[7]
Contents |
[edit] History
The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles."[8]
[edit] Operation
Scanned text is subjected to analysis by two different optical character recognition programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 votes, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.[9]
[edit] Implementation
reCAPTCHA tests are taken from the central site of the reCAPTCHA project[2] as they are supplying the undecipherable words. This is done through a Javascript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment).[10]
[edit] Mailhide
reCAPTCHA has also created project Mailhide[11] which protects email addresses from being harvested by spambots. The email address is converted into a format that does not allow a crawler to see the full email address. For example, the email "noreply@example.com" would be converted to "nor...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address and others.
[edit] Notes
- ^ a b Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum (2008). "reCAPTCHA: Human-Based Character Recognition via Web Security Measures" (PDF). Science 321 (5895): 1465-1468. doi:. http://www.cs.cmu.edu/~biglou/reCAPTCHA_Science.pdf.
- ^ a b The reCAPTCHA project - part of the Carnegie Mellon School of Computer Science at Carnegie Mellon University.
- ^ "Learn more". reCAPTCHA.net. http://recaptcha.net/learnmore.html. Retrieved on 2008-11-23.
- ^ "reCAPTCHA". reCAPTCHA. 2009-03-18. http://recaptcha.net/.
- ^ Rubens, Paul (2007-10-02). "Spam weapon helps preserve books". BBC. http://news.bbc.co.uk/2/hi/technology/7023627.stm.
- ^ "Fight Spam, Digitize Books". Craigslist Blog. 2008-06. http://blog.craigslist.org/2008/06/fight-spam-digitize-books/.
- ^ TV Converter Box Program
- ^ Hutchinson, Alex (March 2009), "Human Resources: The job you didn't even know you had", The Walrus: 15-16
- ^ Timmer, John (2008-08-14). "CAPTCHAs work? for digitizing old, damaged texts, manuscripts". Ars Technica. http://arstechnica.com/news.ars/post/20080814-captchas-workfor-digitizing-old-damaged-texts-manuscripts.html. Retrieved on 2008-12-09.
- ^ "FAQ". reCAPTCHA.net. http://recaptcha.net/faq.html.
- ^ "Mailhide: Free Spam Protection". reCAPTCHA.net. http://mailhide.recaptcha.net/.