Security and software crafting for hacking minds.
 

H4F - Use robots.txt as a Weapon With Links Rubygem

Did you ever think about how much information did you disclose when you publish a website? In order to control how the site will appear in search results, webmasters create a robots.txt file telling crawlers what they have to consider in their indexing quest and which urls they must ignore so search engines won’t show in search results.

But robots.txt is accessible also by humans that may decide to override your Disallow clauses.

Information gathering is fun again

This the robots.txt for armoredcode website. It’s easy, there are no secrets here so I tell any crawler is coming here to please visit and index all the links.

armoredcode.com’s robots.txt
1
2
User-agent: *
Allow: /

For bigger sites with multiple sections things can be more interesting.

Luckly we don’t want to make troubles but we’re curios and there is nothing weong in looking in a public available text file. The important thing is that you (webmaster) don’t forget to protect directories for unauthorized access.

I know it seems to weird that people wrote a directory in a robots.txt but the same directory is browsable by anyone, but you know… the Net is a strange place and errors can easly occur so you may find a wordpress include directory you can navigate.

Of course in this case, web server will recover the error parsing PHP code before disclosing source code or connection parameters but in other situation people is not so lucky.

You may find cache files that are full of interesting information or even… databases.

Using robots.txt to fingerprint your backend

It’s easy to see that a lot number of websites doesn’t delete installation files or common directories. They simply put all the information on the directory serverd by web server and start publishing content relying only to robots.txt to hide some information.

This behaviour can help an attacker instead to exactly fingerprint your backend technology without even a portscan, just by looking at the robots.txt instead.

Looking at the /_vti_ directories it’s easy to spot a Microsoft .NET powered website, therefore served by a Microsoft Windows operating sytem. If we (as attackers) are lucky enough, those directories change in the time so it will be also possible to fingerprint the exact .NET framework version looking at the directory that are in place.

Drupal and wordpress are both easy to fingerprint by looking at the robots.txt file content.

The former can be detected by a lot of text files that sometimes are still available discosing the exact version number. The latter is easly detected by the wp-something directories that need to be disallowed from indexing.

Fingerprint the exact technology you’re using can give an attacker further information to fine tune more attacks.

Please note that I’m note saying that security through obscurity is good. I’m just saying that you have to remove all the files and all the directories that you don’t need instead of thinking that make them invisible from spiders will be enought.

Let’s the code talk: the links rubygem

As a penetration tester my concern is to gather as much as information as possible possibly without making too much noise.

That’s why I wrote a script to automate the robots.txt file scanning. The script eventually evolved in a full featured project and it evolved in a published rubygem.

The idea is very simple. Taking an url, asking for the robots.txt and parsing for disallow urls, then making the supposed disallowed urls http get and then looking at the response code.

All the links code is under the Links::Api namespace.

First relevant method is the one that fetch robots.txt. I discovered a bug while I was writing at this article but I disclosed it in awhile.

Links::Api.robots - get the robots.txt file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def self.robots(site, only_disallow=true)

  if (! site.start_with? 'http://') and (! site.start_with? 'https://')
    site = 'http://'+site
  end

  list = []
  begin
    res=Net::HTTP.get_response(URI(site+'/robots.txt'))
    if (res.code != "200")
      return []
    end

    res.body.split("\n").each do |line|
      if only_disallow
        if (line.start_with?('Disallow'))
          list << line.split(":")[1].strip.chomp
        end
      else
        if (line.start_with?('Allow') or line.start_with?('Disallow'))
          list << line.split(":")[1].strip.chomp
        end
      end
    end
  rescue
    return []
  end

  list
end

The bug is that I check for Allow and Disallow strings without considering the case. I found in the Net same robots.txt files with the disallow directive written lowercase and so not considered by links.

Than other public Apis are just a wrapper to this private method making the dirty work.

Links::Api.get - get the page
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def self.get(url)
  begin
    uri = URI(url)
    if uri.scheme == 'http'
      res = Net::HTTP.get_response(URI(url))
    else
      request=Net::HTTP.new(uri.host, uri.port)
      request.use_ssl=true
      request.verify_mode = OpenSSL::SSL::VERIFY_NONE
      res = request.get(uri.request_uri)
    end
    return res
  rescue
    return nil
  end
end

links rubygem however has another secret goal that I’m working on: being a full featured site crawler. The Api revealing that is the Links::Api.links.

Links::Api.links - get all links in a webpage
1
2
3
4
5
6
7
8
9
def self.links(url)
  res = Links::Api.get(url)
  if res.nil?
    return []
  end
  doc = Nokogiri::HTML.parse(res.body)
  l = doc.css('a').map { |link| link['href'] }
  l
end

What we’ve learnt?

This hack for fun post was about how gather information using robots.txt file, as developer you have no power but you can enforce the server hardening by removing all unnecessary files and directories.

  1. attackers can use robots.txt file to discover our website sections that are supposed to stay private
  2. we must carefully check if a private webpage has been requested for a non authenticated session
  3. you can code a piece of ruby code just for fun :–)

Let's talk about this

I'm an application security specialist and this my blog about software development, testing and security stuff. Feel free to leave a comment telling me if you liked this post or not. You can even follow @thesp0nge and @armoredcode on twitter.

If you liked this post, don't miss any armoredcode.com update. Subscribe to rss feed or receive new posts directly into your mailbox. Service is courtesy by Google, I won't store your email address in any case.

You can discuss, upvote, downvote, and poke fun of this post over at Hacker News.

This story is also on reddit. You can comment and rate this post here.

There is an "ask me anything" area hosted on GitHub. Feel free to ask me everything.

Comments

Google Analytics Alternative