I'm looking for more optimizations but it seems that the bots got tired for today, should we expect another visit tomorrow?
-----
If it's war they want, it's what they'll get..
Banning bad robots from the site
This is done with a few lines in the .htaccess file. This file contains directives for the web server and are used in this case to redirect all accesses from bad robots to one page, which contains a short explanation why the robot has been banned from the site.
There are two ways to ban a robot, either by banning all accesses from a particular site or by banning all accesses that use a specific id to access the server (most browsers and robots identify themselves whenever they request a page. Internet explorer for example uses Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)", which must be interpreted as "I'm a netscape browser - well, actually I'm just a compatible browser named MSIE 4.01, running on windows 98" (A netscape browser identifies itself with "Mozilla"). In both cases the following lines are used at the beginning of the .htaccess file (note: this works with recent apache web servers, other servers may need other commands):
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
To ban all access from IP numbers 209.133.111.* (this is the imagelock company) use
RewriteCond %{REMOTE_HOST} ^209.133.111..*
RewriteRule ^.*$ X.html [L]
which means: if the remote host has an IP number that starts 209.133.111 rewrite the file name with X.html and stop rewrites.
If you want to ban a particular robot or spider, you need its name (check your access log). To ban the inktomi spider (called Slurp), you can use
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ X.html [L]
In order to ban several hosts and/or spiders, use
RewriteCond %{REMOTE_HOST} ^209.133.111..* [OR]
RewriteCond %{HTTP_USER_AGENT} Spider [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ X.html [L]
Note the "[OR]" after each but the last RewriteCond.
The Robot Trap
Three traps are set on this web site:
* Trap to catch robots that ignore the robots.txt file
This site has a special directory that contains only one file. This directory is mentioned in the the robots.txt file and therefore no robot should ever access that specific file.
In order to annoy robots that read that file anyway, it contains special links and commands such that a robot thinks that there are other important files in that directory. Thanks to a special .htaccess file all those other files actually point to the same file. Besides, to load the file takes always at least 20 seconds without using resources on the server.
The .htaccess files looks as follows
RewriteEngine on
Options +FollowSymlinks
RewriteBase /
RewriteRule ^.*\.html /botsv/index.shtml
ErrorDocument 400 /botsv/index.shtml
ErrorDocument 402 /botsv/index.shtml
ErrorDocument 403 /botsv/index.shtml
ErrorDocument 404 /botsv/index.shtml
ErrorDocument 500 /botsv/index.shtml
and the special file uses server side includes, is named index.shtml and the main parts are:
<html><head><title>You are a bad netizen if you are a web bot!</title>
<body><h1><b>You are a bad netizen if you are a web bot!</h1></b>
<!--#config timefmt="%y%j%H%M%S" --> <!-- of date string -->
<!--#exec cmd="sleep 20" --> <!-- make this page sloooow to load -->
To give robots some work here some special links:
these are <a href=a<!--#echo var="DATE_GMT" -->.html>some links</a>
to this <a href=b<!--#echo var="DATE_GMT" -->.html>very page</a>
but with <a href=c<!--#echo var="DATE_GMT" -->.html>different names</a>
The effect is that each robot that hits this page will see new links and request the same page over and over again. Thanks to the 20 second delay the server should not get too busy (unless the robot uses many accesses at the same time, but that would be a very bad robot indeed).
* Trap to catch robots that misuse the robots.txt file
This site has a special directory with the same properties and files as the one above, except that there is no link to it at all. This directory is only mentioned in the the robots.txt file and therefore no robot should ever access that specific file unless it reads the robots.txt file.
Marc has written a program that will automatically ban access to sensitive directories for all clients that access the robots.txt file. I have not tested it though.
* Traps to catch robots that slurp up email addresses
Each of the two files above and an additional one which is plainly visible contain an email address which is generated new for each robot. If that address is ever used, it is trivial to find out who slurped the email address and then block it. To generate email addresses I use
here an email address you better do not use:
<a href=mailto:bot.<!--#echo var="DATE_GMT" -->@ars.net>bot.<!--#echo var="DATE_GMT" -->@ars.net</a>. To make other robots happy as well,
This assumes that the file contains a line as the one above.
Download the traps
If you want to install the traps you can download them here:
http://www.fleiner.c...ts/robotrap.zipfrom: fleiner.com/bots/