Create a quick and dirty web crawler with ruby

A couple of days ago, I was starting a new security activity over a website I never saw before. If you remember a last year post, the first task is to crawl the website looking for intersting pages.

In that post example, crawler was very lame and basic. I need something more sophisticated that eventually it will become part of the codesake project.

Tell me more about the target

First version of the crawler just used anemone rubygem to print page urls.

``` ruby the first version of my custome made crawler require ‘anemone’

Anemone.crawl(target) do |anemone| anemone.on_every_page do |page| puts page.url end end

Of course this works but I have no information about page content or eventually
if the website answer with a particular error code to my requests.

I needed to upgrade my code with three basic functionalities:

* basic HTML parsing to find forms in the retieved web page
* HTTP return code handling
* persistence: I need to save my data in a usable format

I also told anemone to stop at site depth equal to 2. Of course this can be
easily turned into a command line parameter.

It all starts with:

``` ruby the new crawler anemone loop start
Anemone.crawl("#{site}", :discard_page_bodies => true, :depth_limit=>2) do |anemone|
  anemone.on_every_page do |page|


Basic HTML parsing

After I retrieved the page I wrote two small routines to read HTML body either in case of HTTP or HTTPS protocol. This is a quick and dirty code, please allow me not to having spent too much time in refactoring.

```ruby basic html parsing for my crawler

res = read_http(page.url) if page.url.instance_of?(URI::HTTP) res = read_https(page.url) if page.url.instance_of?(URI::HTTPS)

if res.code.to_i == 200 doc = Nokogiri::HTML(res.body) puts “#{page.url} (depth: #{page.depth}, forms:#{“//form”).count}) “ end

Of course I used [nokogiri]( gem to parse
the HTML code. The pieces of code reading the page are not that kind of magic. 

I broke into 2 routines mostly for not to deal with too complexity in a single
piece of code. You can improve this code in tons of ways.

``` ruby the two routines reading the page
def read_http(url)
 uri = URI(url)

def read_https(url)
  response = nil
  uri = URI(url)

  http =, uri.port)
  http.use_ssl = true
  http.start do |http|
    response = Net::HTTP.get_response(uri)

HTTP response code handling

I need to now if a referenced page is a broken link or if it raises an application error. This can be some old code no more referenced or buggy or whatever.

Even more interesting is if the page redirect me to another location. Eventually I will find another url this way.

I also prompted the user if the website is asking me credentials with basic authentication. An interesting place to try to bruteforce some very well known user’s password.

``` ruby http response code handling in my crawler loop puts “#{page.url} is a redirect to #{res[‘location’]}” if res.code.to_i == 301

if res.code.to_i == 200 doc = Nokogiri::HTML(res.body) puts “#{page.url} (depth: #{page.depth}, forms:#{“//form”).count}) “ end

puts “#{page.url} was not found” if res.code.to_i == 404 puts “#{page.url} requires authorization” if res.code.to_i == 401 puts “#{page.url} returns an application error” if res.code.to_i == 500

## Persistence

I do love [datamapper]( ORM so I used it to store
retrieved url in a SQLite database. 

``` ruby my datamapper model for urls
require 'dm-sqlite-adapter'

class Url
  include DataMapper::Resource

  property  :id,          Serial
  property  :url,         Text,       :required=>true
  property  :code,        Integer
  property  :redirect,    Text
  property  :depth,       Integer
  property  :forms,       String,     :length => 256
  property  :created_at,  DateTime,   :default=>
  property  :updated_at,  DateTime,   :default=>


``` ruby saving my urls if not already there if ! Url.first(:url=>page.url) u = u.url = page.url u.depth = page.depth u.forms = doc.css(“form”).map{ |a| (a[‘name’].nil?)? “nonamed”:a[‘name’] }.compact.to_s.gsub(“\n”, “,”) unless doc.nil? u.code = res.code.to_i u.redirect = res[‘location’] if res.code.to_i == 301

ret = saved += 1 if ret if ! ret

puts "#{page.url} not saved"
u.errors.each do |e|
  puts " * #{e}"
end   end

end ```

Off by one

Of course you can improve the crawler in millions of way. I will add some information about the detected form and I will alert if an url will accept parameters also in the query string.

Something interesting can be also detecting comments into HTML code and to grab all javascript files referenced by the projects.

The new crawler I’m hacking on is hosted on as usual.

Image courtesy by Martin LaBar

comments powered by Disqus