One of the tools WiseGuys uses to gather web data
is a web agent, also known as spider or (web) crawler .
This is a program which automatically visits websites, gathering its
information for further use. This data can then be used to create one or more
search engines for instance.
If you do not wish certain pages to be retrieved by such a spider
or crawler then you can express this in one of two possible
methods. The first method, robots.txt is only valid if you place a
file in the document root of your website. The second, metatags always
works.
robots.txt
Most decent web agents request the file /robots.txt before visiting a
website. If this file exists, it is parsed by the web agent. WiseGuys complies
to this standard, if and only if the file is present in the document root of
the website: http://www.wise-guys.nl/robots.txt will be used, but
http://www.wise-guys.nl/gfx/robots.txt will not be used.
The complete standard specs, written by Martijn Koster, can be
read at the WebCrawler website. Nowadays they seem to have moved elsewhere,
however a
cached
version is available. The basic idea is simple: via some text file you can
teach the web agent that some parts of your webserver are off-limits for it.
These restrictions can apply for all web agents, or for some in particular.
This feature can be used if your pages turn up in search engines, and you do
not want them there.
For example:
# /robots.txt
User-agent: webcrawler
Disallow:
User-agent: ilse
Disallow: /
User-agent: *
Disallow: /stayout
Disallow: /tmp
The first line, beginning with '#', is a comment for the author. The second
and third line grant the robot called webcrawler access to the whole site,
because nothing is Disallowed (hence the empty directive). You can safely
disregard this directive, as webagents will, per definition, regard the website
as allowable if they do not find a robots.txt file. The fourth and fifth lines
state that the robot called ilse may not retrieve any files beginning with
/ . Because this practically means 'every page', because all pages start
with a / , this means that the robot cannot visit this site at all.
The last three lines state that all other robots should not visit pages
beginning with /stayout or /tmp .
Note some frequently made errors:
- Regular expressions via robots.txt are not supported: You must use
Disallow: /tmp/ in stead of Disallow: /tmp/* .
- State only 1 path per Disallow: statement.
- Paths in a robots.txt file are like paths in a URL, in particular they are
case sensitive , so they take capital and small letters into account.
Meta tags
In stead of a global robots.txt file for an entire website, you can also
direct the web agents per page if that page can be submitted into a search
engine. Also it is possible to express if you wish the spider to follow links
on the page. This is particularly useful if your web page queries some database
which you do not want a spider to visit over and over again. Use of meta tags
is a part of the HTML 4 standard and information about this standard can be
read at the WWW Consortium website. A somewhat less technical explanation can
be found here. Using these meta-tags is extremely simple. In the
HEAD section of a page, type an extra directive: <meta name="robots"
content="noindex">
With this line, you tell the robot not to index the page (not to submit it
into its search engine). The links on the page, however, will be followed. Use
nofollow to deny this: <meta name="robots" content="noindex,
nofollow">
Not using noindex means, that the page itself will be submitted,
but the pages it links to will not be visited.
WiseGuys webagent
The WiseGuys webagent visits primarily dutch (.nl) websites, WAP sites
however are visited worldwide. The robot identifies itself with the following
textstring: Vagabondo . An alternate identification is: Bilbo .
This is an old webagent, that is mainly still in use for WAP. These robots
mainly gather data for parties including Track.nl and Kobala.nl.