Friday, January 12, 2007

What is robots.txt?

Before I explain why and what of robots.txt file, let me give you an incident that beat us off board sometime back. (For security reasons, I have excluded the names.)

We used to serve real-time/delayed quotes to our customers. One of the customers wanted to provide searching functionality for their web site. Hence they bought and indexing/searching application which had a crawler at its core. The customer had a list of Top 10 Active stocks and their respective delayed quotes in their landing page. Some stocks, would have rapid fluctuations in their prices. The crawler (at that time what we called a "stupid crawler," without knowing robots.txt file) started indexing the landing page as rapidly as the values change. For each request, the customer's application server started sending ten quote requests to us. Lucky we! We had a very robust infrastructure that our server didn't come down. But at the end of two days we had thousands and thousands of quote requests, which surged our service graph to an unprecedented level!

Well that was the story. We figured the issue by looking at the logs and informed the customer about the issue. Last I heard, they had disabled the search facility.

Thats where the robots.txt comes into picture to save us.
robots.txt a rules file that every crawler reads before it crawls a particular site.
It has the list of directories whose contents will dynamically change and hence must not be indexed by the crawler. It also facilitates crawler specific rules. For e.g. If is present as follows, then Google wouldn't index
User-agent: googlebot
Disallow: /
You can read much more about robots.txt here.

No comments:

SublimeText 3/Anaconda error

When I installed Anaconda manually by downloading and untarring the file (as given in the manual installation instructions here ), I got th...