In 2011 and 2012, Realtor.com was under the gun to solve the problem it had with screen scrapers, where sites were “scraping” data off of their site and using it in unauthorized contexts. For those that haven’t been watching industry news sites discussion of screen scraping, scraping is when someone copies large amounts of data from a web site – manually or with a script or program. There are two kinds of scraping, “legitimate scraping” such as search engine robots that index the web and “malicious scraping” where someone engages in systematic theft of intellectual property in the form of data accessible on a web site. Realtor.com spent spent hundreds of thousands of dollars to thwart malicious scraping, and spoke about the sreen-scraping challenge our industry faces at a variety of industry conferences that year, starting with Clareity’s own MLS Executive Workshop. The takeaways from the Realtor.com presentations were as follows:
- The scrapers are moving from Realtor.com toward easier targets … to YOUR markets.
- The basic protections that used to work are no longer sufficient to protect against today’s sophisticated scrapers.
- It’s time to take some preventative steps at the local level – and at the national/regional portal and franchise levels.
Clareity Consulting had wanted to solve the scraping problem for a long time, but there hadn’t been much evidence that the issue was serious before Realtor.com brought it up – and there hadn’t been any evidence of demand for a solution. Late last year, Clareity Consulting surveyed MLS executives, many of whom had seen the Realtor.com presentation, and 93% showed interest in a solution. Some industry leaders also stepped up with strong opinions advocating taking steps to stop content theft:
“It is not so much about protecting the data itself but protecting the copyright to the data. If you don’t enforce it, the copyright does not exist.”
- Russ Bergeron
“I am opposed to anybody taking, just independently, scraping data or removing data without permission…..We have spent millions of dollars and an exorbitant amount of effort to get that data on to our sites.”
- Don Lawby, Century 21 Canada CEO
The problem didn’t seem to be stopping – in 2012 (and still, in 2013) people continue to advertise for freelancers to create NEW real estate screen-scrapers on sites like elance.com and freelancer.com. Also, we know that some scrapers aren’t stupid enough to advertise their illegal activities. So, Clareity began working to figure out the answer.
There were six main criteria on which Clareity evaluated the many solutions on the market. We needed to find a solution that:
1. is incredibly sophisticated to stop today’s scrapers,
2. scales both “up” to the biggest sites and “down” to the very smallest sites,
3. is very inexpensive, especially for the smallest sites – if there is any hope of an MLS “mandate”,
4. is easy to implement and provision for all websites,
5. is incredibly reliable and high-performing, and
6. is part of an industry wide intelligence network.
Most of those criteria, with the exception of the last one, should be self explanatory. The idea of an “industry wide intelligence network” is that once a scraper is identified by one website, that information needs to be shared so the scraper doesn’t just move on to another website, which takes additional time to detect and block the scraper, and so on.
Clareity evaluated many solutions. We looked at software solutions that can’t be integrated the same way into all sites and wouldn’t work, because the customization cost and effort would make it untenable. We looked at hardware solutions that similarly require rack space, installation, different integration into different firewalls, servers etc. – and similarly won’t work either – at least for most website owners and hosting scenarios. We looked at tools that some already had in place – software solutions that did basic rate limiting and other such detections, as well as some “IDS” systems websites already had in place – but none could reliably detect today’s sophisticated scrapers and provide adaptability to their evolution. The biggest problem we found was COST – we knew that for most website owners even TWO figures per month would be untenable, and all the qualified solutions on the market ranged from three to five figures per month.
Finally, we had a long conversation with Rami Essaid, the CEO of Distil Networks. Distil Networks met many of our criteria. They were a U.S. company, with a highly redundant U.S. infrastructure. They provided a highly redundant infrastructure (think 15+ data centers and several different cloud providers) allowing for not only high reliability, but an improvement to website speed. What they provide is a “CDN” (Content Delivery Network) just like most large sites on the Internet use to improve performance – but this one also monitors for scraping. We think of it as a “Content Protection Network” or “CPN”. Implementation is as easy as re-pointing the IP address of the domain name. They also have a “Behind the firewall” server solution for largest sites – more like what Realtor.com uses. Most importantly, once Clareity Consulting described the challenge and opportunity for our industry, they worked to tailor both a unique solution and pricing for our unique industry challenge. If adopted, using this custom solution Clareity can monitor industry trends and help the industry take action against the worst bad-actors.
Some MLSs have already successfully completed a “beta” and seen the benefits of both blocking scraper “bots” from their websites as well as the performance gains, and now more than a dozen other MLSs have already started their free trials and will be considering the best way to have all subscribers enroll their websites as a reasonable step to protecting the content.
If organized real estate actually organizes around this solution, allowing us to collect the data to stop the scrapers and go after the worst offenders, we will be able to get our arms around this problem once and for all.
For more information: