|Your Daily Source for Apache News and Information|
|Breaking News||Preferences||Contribute||Triggers||Link Us||Search||About|
Every day you have entries in your server logs, telling you about someone requesting a file called
This week, we'll talk about spiders and robots, what they do, why they are a good thing, and why they can be a bad thing, and how you can prepare for them.
What are Robots and Spiders?
A robot, also called a "bot," a spider, a web crawler, and a variety of other names, is a program that automatically downloads web pages for a variety of purposes. Because it downloads one page, and then recursively downloads every page that is linked to from that page, one can imagine it crawling around the web, harvesting content. From these images come some of the names that these programs are called, as well as some of the names of particular spiders, like WebCrawler, Harvester, MomSpider, and so on.
What they do with these web pages once they have downloaded them varies from application to application. Most of these spiders are doing some variety of search, while some exist so that people can read online content when they are not actually connected to the internet. Others pre-cache content, so that end-users can more rapidly access that content. Still others are just trying to generate statistical information about the web.
Spiders are Good
Spiders are a good thing. Without them, things like Yahoo, Google, Altavista, and so on, would not be possible. By indexing the web with a spider, these sites permit us to search a collection of documents that it is not feasible to search by ourselves. I remember a few years ago when Yahoo claimed to index "more than 1 million" web pages. That number is now probably in the billions.
Spiders can get us personalized information. There are services that will deliver to your mail box every morning a personalized newspaper, containing only news items that you have expressed interest in, from the online newspapers that you have specified. This is done with robots that download those web sites, and comb through it for the information that you have requested. This would be extremely difficult and expensive were this done by actual people, or with actual paper newspapers.
Of course, occasionally, spiders can cause real problems with your web site. Poorly written, or poorly managed spiders can download pages from your site far faster than anyone could click on links, and can bring your web server to a grinding halt as it tries to keep up with a robot requesting thousands of pages per second.
Usually, this will only happen if someone has been careless in configuring a robot, but occasionally it's just an incompetent progammer that has tried to be clever and write their own robot, but don't know what they are doing.
Most of the time, however, spiders are great, and you want them crawling around your site, so that your site ends up on the search engines.
But there are parts of your web site where you don't want them going. There might be content that you don't really want to be in search engines. Or, perhaps there is a part of your web site that is dynamically generated, and so it would be a little silly to index it, because it will be different the next time.
Even worse, since some dynamically generated pages have links to other dynamically generated pages, the robot could become stuck in a never-ending maze of new pages, and request documents from your server forever.
In order to prevent these things from happening, the standard for robot exclusion was developed. This consists of a document called
The file looks like this:
UserAgent: SpiderName Disallow: /cgi-bin/dynamicstuff/
The named spider is not permitted access to the specified resources.
You can also specify a asterisk (*) instead of the spider name, to indicate that no spiders are permitted access:
UserAgent: * Disallow: /cgi-bin/
=head1 What if that does not work (malicious or stupid spiders)
The problem with
If you discover that a robot is going into parts of your web site which you have disallowed in
First, you should attempt to contact the person that is operating the robot. Look in the access log, and find what address the robot is coming from. Find out who is responsible for that machine. Send them a polite note requesting that they leave your site along, and encourage them to adhere to the standard for robot exclusion.
If they do not pay any attention to your request, use some of the techniques covered in last week's article (
Writing a Robot
If you want to write your own robot, you should be aware that there are already hundreds of robots available to do everything you might want to do. Consider using one of those. You can find a list, and other information about robots, at http://info.webcrawler.com/mak/projects/robots/robots.html
If you absolutely have to write your own, there are various Perl modules in the LWP package which implement much of the base functionality that you need, and will greatly ease the amount of effort required to write one from scratch.
And please, please, implement the standard for robot exclusion. It's very very easy to write a robot, but it's a little more involved to write a good one that does not have any ill effects on the sites it visits.
|About Triggers||Media Kit||Security||Triggers||Login|
All times are recorded in UTC.
Linux is a trademark of Linus Torvalds.
Powered by Linux 2.4, Apache 1.3, and PHP 4