Search / web app giant Google announced yesterday on the Google Blog that they have just acquired human-verification system reCAPTCHA, which aside from preventing bots from registering for web sites, also helps digitize books by confirming words that a computerized optical character recognition (OCR) system was not able to recognize. This is being used to digitally convert books including the likes of William Shakespeare and other classic treasures for posterity and future generations. The service started off as an academic project at Carnegie Mellon, but eventually became its own company.
What is a CAPTCHA? It stands for Completely Automated Public Turing test to tell Computers and Humans Apart. Basically, bots have a hard time reading these images of words, much like OCR systems do. So the reCAPTCHA project takes a known word, then takes a word OCRs are having a hard time with, and puts them side by side for the human to verify. After the unknown book word is identified the same enough times, it is added to the known words list. The obvious way this could have a problem is if enough people incorrectly identify a word on purpose, but since its never identified which is the known word and which is the unknown, that gets pretty difficult.
Google may have interest in the project based on its Google Books property and their ever-growing collection of public domain works. Also, the project seems to fit Google’s general mantra of “Don’t be evil.” ReCAPTCHA helps more than 100,000 web sites block spammers and bots and other fraudulent bad guys. Other good things about Google’s digitization of books from their blog:
“Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we’ll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process.”