Most of us who use the Internet have often seen the weird twisted fonts in web pages in various forms and sizes. Even though we have observed and familiar with them, only a few know that it is CAPTCHA. In this article, we will study the concepts of CAPTCHA, its application in the cyber world and also its imperfection with respect to security.
You’ve probably seen them. Colourful images with distorted text in them at the bottom of web registration forms. CAPTCHAs are used by Yahoo, Hotmail, PayPal and many other popular websites to prevent automated registrations, and they work because no computer program can currently read distorted text as well as humans can. What you probably don’t know is that a CAPTCHA is something more than just an image with distorted text. It is a test, any test, that can be automatically generated, which most humans can pass, but that current computer programs cannot pass. Notice the paradox, a CAPTCHA is a program that can generate and grade tests that it itself cannot pass.
CAPTCHAs were created in response to bots that automatically fill in web forms as if they were individual users. Bots are used to overload opinion polls, steal passwords through dictionary attacks and, most popular, to register thousands of free e-mail accounts to be used for sending spam. It is for this very reason that CAPTCHAs were designed to circumvent non-humans from performing such transactions.
What is CAPTCHA?
A CAPTCHA is a challenge-response test most often placed within web forms to determine whether the user is human. The process usually involves one computer usually a server asking a user to complete a simple test which the computer is able to generate and grade. By default when other computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. Thus, it is sometimes described as a reverse Turing test, because it is administered by a machine and targeted to a human, in contrast to the standard Turing test that is typically administered by a human and targeted to a machine. A common type of CAPTCHA requires that the user type letters or digits from a distorted image that appears on the screen.
Fig. 1: Implementation of CAPTCHA in a Web page
The Term
The term "CAPTCHA" was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper who were from Carnegie Mellon University, and John Langford who was then working in IBM. It is a contrived acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart." Carnegie Mellon University attempted to trademark the term, but the trademark application was abandoned on 21 April 2008. Currently, CAPTCHA creators recommend use of reCAPTCHA as the official implementation. The concept of reCAPTCHA is dealt in the later part of this article.
The Origination
Moni Naor was the first person to theorize a list of ways to verify that a request comes from a human and not a bot. Primitive CAPTCHAs seem to have been developed in 1997 at AltaVista by Andrei Broder and his colleagues to prevent bots from adding URLs to their search engine. In order to make the images resistant to OCR (Optical Character Recognition), the team simulated situations that scanner manuals claimed resulted in bad OCR. In 2000, Luis von Ahn and Manuel Blum coined the term 'CAPTCHA', improved and publicized the notion, which included any program that can distinguish humans from computers. They invented multiple examples of CAPTCHAs, including the first CAPTCHAs to be widely used, which were those adopted by Yahoo!.
CAPTCHA and Security
The purpose of CAPTCHA is to block form submissions from spambots – automated scripts that harvest email address from publicly available web forms. Captchas are used to prevent bots from using various types of computing services. These applications include preventing bots from taking part in online polls, registering for free email accounts which may then be used to send spam. More recently, captchas were used to prevent bot-generated spam by requiring that the unrecognized sender successfully pass a captcha test before the email message is delivered.
Characteristics
A CAPTCHA system is a means of automatically generating new challenges which:
- Current software is unable to solve accurately.
- Most humans can solve.
- Does not rely on the type of CAPTCHA being new to the attacker.
- Although a checkbox "check here if you are not a bot" might serve to distinguish between humans and computers, it is not a CAPTCHA because it relies on the fact that an attacker has not spent effort to break that specific form.
Withholding of the algorithm can increase the integrity of a limited set of systems, as in the practice of security through obscurity. The most important factor in deciding whether an algorithm should be made open or restricted is the size of the system. Although an algorithm which survives scrutiny by security experts may be assumed to be more conceptually secure than an unevaluated algorithm, an unevaluated algorithm specific to a very limited set of systems is always of less interest to those engaging in automated abuse. Breaking a CAPTCHA generally requires some effort specific to that particular CAPTCHA implementation, and an abuser may decide that the benefit granted by automated bypass is negated by the effort required to engage in abuse of that system in the first place.
Fig. 2: Early CAPTCHAs such as these, generated by the EZ-Gimpy program, were used on Yahoo!.
Applications
CAPTCHAs are used to prevent automated software from performing actions which degrade the quality of service of a given system, whether due to abuse or resource expenditure. Although CAPTCHAs are most often deployed as a response to encroachment by commercial interests, the notion that they exist to stop only spammers is mistaken. CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam, such as the webmail services of Gmail, Yahoo! Mail and Hotmail. CAPTCHAs have also found active use in stopping automated posting to blogs and forums, whether as a result of commercial promotion, or harassment and vandalism. CAPTCHAs also serve an important function in rate limiting, as automated usage of a service might be desirable until such usage is done in excess, and to the detriment of human users. In such a case, a CAPTCHA can enforce automated usage policies as set by the administrator when certain usage metrics exceed a given threshold. The article rating systems used by many news web sites are another example of an online facility vulnerable to manipulation by automated software.
In summary, CAPTCHAs have several applications for practical security, including (but not limited to)
* Preventing Comment Spam in Blogs: Most bloggers are familiar with programs that submit bogus comments, usually for the purpose of raising search engine ranks of some website (e.g., "buy penny stocks here"). This is called comment spam. By using a CAPTCHA, only humans can enter comments on a blog. There is no need to make users sign up before they enter a comment, and no legitimate comments are ever lost!
* Protecting Website Registration: Several companies (Yahoo!, Microsoft, etc.) offer free email services. Up until a few years ago, most of these services suffered from a specific type of attack: "bots" that would sign up for thousands of email accounts every minute. The solution to this problem was to use CAPTCHAs to ensure that only humans obtain free accounts. In general, free services should be protected with a CAPTCHA in order to prevent abuse by automated scripts.
* Protecting Email Addresses From Scrapers: Spammers crawl the Web in search of email addresses posted in clear text. CAPTCHAs provide an effective mechanism to hide your email address from Web scrapers. The idea is to require users to solve a CAPTCHA before showing your email address.
* Online Polls: In November 1999, http://www.slashdot.org released an online poll asking which was the best graduate school in computer science (a dangerous question to ask over the web!). As is the case with most online polls, IP addresses of voters were recorded in order to prevent single users from voting more than once. However, students at Carnegie Mellon University (CMU) found a way to stuff the ballots using programs that voted for CMU thousands of times. CMU's score started growing rapidly. The next day, students at MIT wrote their own program and the poll became a contest between voting "bots." MIT finished with 21,156 votes, Carnegie Mellon with 21,032 and every other school with less than 1,000. Can the result of any online poll be trusted? Not unless the poll ensures that only humans can vote.
* Preventing Dictionary Attacks: CAPTCHAs can also be used to prevent dictionary attacks in password systems. The idea is simple: prevent a computer from being able to iterate through the entire space of passwords by requiring it to solve a CAPTCHA after a certain number of unsuccessful logins. This is better than the classic approach of locking an account after a sequence of unsuccessful logins, since doing so allows an attacker to lock accounts at will.
* Search Engine Bots: It is sometimes desirable to keep webpages unindexed to prevent others from finding them easily. There is an html tag to prevent search engine bots from reading web pages. The tag, however, doesn't guarantee that bots won't read a web page; it only serves to say "no bots, please." Search engine bots, since they usually belong to large companies, respect web pages that don't want to allow them in. However, in order to truly guarantee that bots won't enter a web site, CAPTCHAs are needed.
* Worms and Spam: CAPTCHAs also offer a plausible solution against email worms and spam: "I will only accept an email if I know there is a human behind the other computer." A few companies are already marketing this idea.
User Accessible CAPTCHAs
There have been various attempts at creating CAPTCHAs that are more accessible. Because CAPTCHAs rely on visual perception, users unable to view a CAPTCHA, (for example, due to a disability or because it is difficult to read) will be unable to perform the task protected by a CAPTCHA. Even an audio and visual CAPTCHA will require manual intervention for some users, such as those who have visual disabilities and also are deaf.
Therefore, websites implementing CAPTCHAs may provide an audio version of the CAPTCHA in addition to the visual method. The official CAPTCHA website recommends providing an audio CAPTCHA for accessibility reasons.
Many attempts have been made to make more accessible CAPTCHAs. Attempts include the use of JavaScript, mathematical questions ("what is 1+1" or even more complex problems like derivatives or polynomial factorization - also known as a MAPTCHA, or Mathematical CAPTCHA), or "common sense" questions ("what color is the sky on a clear day").
Fig. 3: Another way to make segmentation difficult is to crowd symbols together.
Limitations
Like any security system, design flaws in a system implementation can prevent the theoretical security from being realized. Many CAPTCHA implementations, especially those which have not been designed and reviewed by experts in the fields of security, are prone to common attacks.
Some CAPTCHA protection systems can be bypassed without using OCR simply by re-using the session ID of a known CAPTCHA image. A correctly designed CAPTCHA does not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a correct CAPTCHA solution or making a second guess after an incorrect OCR attempt. Other CAPTCHA implementations use a hash such as an MD5 hash of the solution as a key passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked. Further, the hash could assist an OCR based attempt. A more secure scheme would use an HMAC. Finally, some implementations use only a small fixed pool of CAPTCHA images. Eventually, when enough CAPTCHA image solutions have been collected by an attacker over a period of time, the CAPTCHA can be broken by simply looking up solutions in a table, based on a hash of the challenge image.
There are a few approaches to defeating CAPTCHAs. They are
- Exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA
- Improving character recognition software, or
- Using cheap human labor to process the tests.
Trouncing CAPTCHAs through Puzzles
CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One approach involves relaying the puzzles to a group of human operators who can solve CAPTCHAs. In this scheme, a computer fills out a form and when it reaches a CAPTCHA, it gives the CAPTCHA to the human operator to solve.
Another variation of this technique involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site owned by the attacker. With enough traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site. In October 2007, a piece of malware appeared in the wild which enticed users to solve CAPTCHAs in order to see progressively further into a series of indecent images.
Projects Defeating CAPTCHAs
A number of research projects have attempted (often with success) to beat visual CAPTCHAs by creating programs that contain the following functionality:
1. Pre-processing: Removal of background clutter and noise.
2. Segmentation: Splitting the image into regions which each contain a single character.
3. Classification: Identifying the character in each region.
Steps 1 and 3 are easy tasks for computers. The only step where humans still outperform computers is segmentation. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.
Several research projects have broken real world CAPTCHAs, including one of Yahoo's early CAPTCHAs called "EZ-Gimpy" and the CAPTCHA used by popular sites such as Paypal, LiveJournal, phpBB, and other open source solutions. In January 2008 Network Security Research released their program for automated Yahoo! CAPTCHA recognition. Windows Live Hotmail and Gmail, the other two major free email providers, were cracked shortly after.
In February 2008 it was reported that spammers had achieved a success rate of 30% to 35%, using a bot, in responding to CAPTCHAs for Microsoft's Live Mail service and a success rate of 20% against Google's Gmail CAPTCHA. A Newcastle University research team has defeated the segmentation part of Microsoft's CAPTCHA with a 90% success rate, and claim that this could lead to a complete crack with a greater than 60% rate.
AI and CAPTCHA
CAPTCHA tests are based on open problems in Artificial Intelligence (AI). Decoding images of distorted text, for instance, is well beyond the capabilities of modern computers. Therefore, CAPTCHAs also offer well-defined challenges for the AI community, and induce security researchers, as well as otherwise malicious programmers, to work on advancing the field of AI. CAPTCHAs are thus a win-win situation: either a CAPTCHA is not broken and there is a way to differentiate humans from computers, or the CAPTCHA is broken and an AI problem is solved.
reCAPTCHA
Some of the original inventors of the CAPTCHA system have implemented a means by which some of the effort and time spent by people who are responding to challenges can be harnessed as a distributed work system. This system, called reCAPTCHA, works by including "solved" and "unrecognized" elements (images which were not successfully recognized via Optical Character Recognition - OCR) in each challenge. The respondent thus answers both elements and roughly half of his or her effort validates the challenge while the other half is captured as work.
Fig. 4: An example of a reCAPTCHA challenge, containing the words "following finding".
The Battle of Bots and CAPTCHAs
After CAPTCHAs were deployed in 2001, the felonious bots were updated to analyze the distorted text, enter the correct text and thereby render many CAPTCHA styles ineffective. In an on-going battle between the bots and the CAPTCHAs, the CAPTCHA text is increasingly more distorted and camouflaged, often making it difficult for humans to decode. Non-text approaches have been added; for example, displaying several images and asking what object is common in all of them, such as a tree or dog.
Thus the ongoing battle between the Bots and CAPTCHAs continues. But what seems to be on every net users mind is – Is the outcome of this battle for worse or better?
—By: R. Manoj, Assistant Editor, 'InfoSecurity' magazine, Fanatic Media. |