Scraping: Data theft is scaling up

Attacks against web applications have expanded in scope—from attempts to extract credit card information from e-commerce sites to scraping entire libraries of valuable information from subscription-based sites.

While such attacks fall into a gray area and are not always illegal, they can cause significant damage to web-based businesses and their customers. Often what's at stake is an organization's intellectual property. Companies that do business on the web need to understand how scraping can unearth the very foundation of their business.

For example, a hacker could log on to a research firm's website and launch an automated tool to extract volumes of information quickly and effortlessly. If the hacker were to make that information available free-of-charge on the web, it would render the firm's research library a valueless commodity overnight and destroy the business.

If launched against a business networking site, such an attack could collect personal information intended to be available only with permission to other subscribers. By making the information publicly available, the hacker would not only negate the viral marketing model of the site, but expose private contact information and activities, such as a job search.

Many of the early reported scraping attacks were launched to grab email addresses from websites for later use in launching SPAM attacks. Today, scraping attacks are expanding into a form of automated intellectual property theft.

Attackers can scrape websites by first creating a legitimate account on the web application. Once logged-in, they launch an automated tool, often called a "robot" or a "bot,” to extract information in bulk which was intended to be served up one record at a time to a legitimate subscriber. Hackers may also contract a "botnet" - a large herd of hacked computers available for hire to the highest bidder – to accelerate the attack.

Although potentially harmful, the simple act of scraping a website is not always malicious or illegal. Search engines and shopping comparison websites use "good" bots to crawl websites—a welcome form of scraping. Likewise, using software to grab graphics and information from websites for inclusion in a slide presentation is common and usually harmless. Perhaps a bit more insidious is the e-commerce marketer scraping competitors' sites for pricing information.

Scraping becomes problematic when an attacker purloins web-based information under subscription to share free-of-charge. More complex attacks combine scrapes of intellectual property with probes for security holes that have left the company vulnerable to hacking. In addition to stealing intellectual property and uncovering security issues, a large-scale scrape may also pose performance issues similar to a denial-of-service attack.

Because scraping is not always harmful and can have a multitude of purposes and results, there is no off-the-shelf solution to the problem. An organization cannot simply block all automated scraping attempts because it would deter the "good" bots. As malware researchers and technology vendors identify these new hacker techniques, flexible new solutions are coming to market that address the gray areas.

Products that automatically learn the expected and correct use of a web application can accurately detect anomalous behavior. For instance, the WebDefend web application firewall can detect automated attack tools and prevent them from extracting valuable corporate information, while giving legitimate automated programs, such as search engine crawlers, access to the site. With such a tool, businesses have a simple way to continuously monitor and protect web applications from attack, while also uncovering coding errors that prevent web applications from functioning as designed.