July 10th, 2008

searchtools.com

x-robots-tag


In the Robots Exclusion Protocol June 08 Agreement, the leading webwide search engines announced that they would recognize a new element in the HTTP header, the X-Robots-Tag. Google started using it at first, then Yahoo and now Microsoft Live Search is supporting it.

When a browser or robot sends a request to the web server for a URL, part of the response is the invisible HTTP header, including information about the file type, encoding, and date modified. This information is generated by the web server.

The new X-Robots-Tag, within the HTTP response header, can contain same values as the Robots META tags: NOINDEX, NOFOLLOW, NOARCHIVE, NOODP, NOSNIPPET.

There are several cases where the X-Robots-Tag values will be very valuable:

  • For non-HTML documents, including plain text, XML, PDF, office documents, audio, and video files. While many of these documents are able to carry Properties information or metadata such as XMP, they rarely do, and even then, it's often incorrect or duplicated.
  • For situations where the web site publisher cannot change the content of the HTML files, but wishes to control some of the site interaction with search indexing robots.
  • Sites with large amounts of changing content, where updating individual files is too hard or expensive.

This is not something anyone can type in by hand, but it's easily added by programmatically by server-side tools such as Perl, Ruby, or PHP. For simple cases, the Apache .htaccess file is easy enough to configure, as in this example where the crawler is told not to index content in robots.txt:

<FilesMatch "robots\.txt">
Header set X-Robots-Tag "NOINDEX, FOLLOW"
</FilesMatch>

or to avoid following links in".doc" files

<FilesMatch "\.doc$">
Header set X-Robots-Tag "NOFOLLOW"
</Files>

I think this is a very clever way to add the known functionality of Robots META tags to non-HTML file formats, collated from an external metadata repository. It's likely to be particularly useful to intranet search engines, and portals which may not have access to the documents themselves.


I have added an X-Robots-Tag test suite to the SearchTools testing section and will report if I find anything interesting.

H/T to: Playing with the X-Robots-Tag; Controlling Your Robots; Handling Google's neat X-Robots-Tag

  • Current Mood
    tangential