Friday, September 22, 2006

Spidering With Ruby

If your boss suddenly asks you to convert an HTML only web page with real state listings with more than 10 years of accumulated information into a database powered web page what you do?? This means that more than 150,000 HTML pages, each with house/apartment and their sell/rent/buy information must be inserted in a database.

What I did?? I wrote a ruby spider to scrap the pages and insert them to a database. What I learned?? Ruby is an amazing language with the flexibility to handle the bad HTML pages (i.e. edited by hand and lots of missing ending tags, etc.) and the power to handle more than 150,000 of them in a reasonable span of time.

The first page to see how this is done is the Cafe-Fetcher

Next some advance explanation of the different approaches we can take with Ruby:

After reading this two excellent examples I chose to use WWW::Mechanize because it has all that is needed. It used REXML for HTML parsing and XPath manipulation but as version 6.0 it uses the faster Hpricot library.

So how this works in short? Mechanize is like a browser that can be programmed, with support to cookies, history, etc. You use Mechanize to navigate the web page until you get the particular page you are interested in (i.e. submit forms, click links, etc.). The page that Mechanize returns in fact is a hpricot object that has the HTML parsed and allows you to access all the DOM structure. You can also use XPath to access specific parts of the HTML like get the first column of the third line in the second table of the HTML page.

With this tools, some patience and a time you should be able to automatically scrap all information from any page in the Internet. Make sure you learn XPath as it is a powerful XML query language to access the data in HTML pages. It is something like an SQL for XML.

Warning: There is a little problem with the Mechanize history feature. If you use a single Mechanize object to scrap lots of pages you will observe that your PC ram is going to start growing non-stop. This is because Mechanize stores each page it downloads in a history array and this pages never get GarbageCollected. To avoid this simply setup the history_max value of the Mechanize like this:

agent =
agent.history_max = 10

Never put it to zero as it will cause some problems when submitting forms.


XPath Tutorial

No comments:

Post a Comment