MSNbot still overspidering

Posted 19:58, 17/3/2007, in Web

I've been monitoring the traffic to Archivist quite a bit recently. Archivist is a publically searchable mailing list archive, you subscribe the system's email address to your mailing list and all posts automagically appear on the site (threaded, and searchable). 

Because Archivist is basically a text-only site, the search engine robots love it, and the majority of the site's traffic comes from search engine referrals. And because of the archive nature of the site, most of the pages on there never change; so we send appropriate last modified HTTP headers to aid caching and help keep the bandwidth usage down.

Unfortunately, unlike all the other major robots, MSNBot completely ignores these and is constantly indexing the same content over and over again. It doesn't take long to find proof of this, here's the robot traffic from April '07:

Screenshot of Archivist’s robot activity

So, over this time periodMSN has done only about 50% more requests than Googlebot, but has used more than six times the bandwidth. (The number after the + is the number of hits to the robots.txt file, for those who aren't familiar with AWStats.)

At the same time MSN provides just 0.4% of the site's search engine referrals (Google is 97.6%). With numbers like this, it's hard to justify not blocking MSN completely.

Add comment

Or login with OpenID

Search this site
Login
(or login/signup the old fashioned way)
Elsewhere

External URLs/articles that may be of interest:

PHP - Architecture, Scalability, and Security

Slides from Rasmus' OSCON talk, once again some interesting ways of tweaking server performance to get the most out of PHP apps.

New Rails documentation site

Lack of a decent manual has always been a big problem for Rails (an API reference is not a manual!) A number of projects have sprung up in the last year or so trying to make up for this, this new one is the best I've seen so far, so hopefully it will gain some traction.

bbc.co.uk moving to Zend Framework

The BBC are updating their perl/SSI based backend to a platform using Java for the backend and Zend Framework for the frontend.

Stock image search engine

Free text search for a number of different stock photo sites, results include thumbnails, dimensions and license info. The advanced search option allows you to, among other things, restrict results to only photos with a particular license (e.g. "public domain").