Home Contact Jobs
About Services Products Projects
We provide spider and search technology
About webcrawlers
 

One of the tools WiseGuys uses to gather web data is a web agent, also known as spider or (web) crawler . This is a program which automatically visits websites, gathering its information for further use. This data can then be used to create one or more search engines for instance.


If you do not wish certain pages to be retrieved by such a spider or crawler then you can express this in one of two possible methods. The first method, robots.txt is only valid if you place a file in the document root of your website. The second, metatags always works.


robots.txt

Most decent web agents request the file /robots.txt before visiting a website. If this file exists, it is parsed by the web agent. WiseGuys complies to this standard, if and only if the file is present in the document root of the website: http://www.wise-guys.nl/robots.txt will be used, but http://www.wise-guys.nl/gfx/robots.txt will not be used.

The complete standard specs, written by Martijn Koster, can be read at the WebCrawler website. Nowadays they seem to have moved elsewhere, however a cached version is available. The basic idea is simple: via some text file you can teach the web agent that some parts of your webserver are off-limits for it. These restrictions can apply for all web agents, or for some in particular. This feature can be used if your pages turn up in search engines, and you do not want them there.

For example:

# /robots.txt
User-agent: webcrawler
Disallow:
User-agent: ilse
Disallow: /
User-agent: *
Disallow: /stayout
Disallow: /tmp

The first line, beginning with '#', is a comment for the author. The second and third line grant the robot called webcrawler access to the whole site, because nothing is Disallowed (hence the empty directive). You can safely disregard this directive, as webagents will, per definition, regard the website as allowable if they do not find a robots.txt file. The fourth and fifth lines state that the robot called ilse may not retrieve any files beginning with / . Because this practically means 'every page', because all pages start with a / , this means that the robot cannot visit this site at all. The last three lines state that all other robots should not visit pages beginning with /stayout or /tmp .


Note some frequently made errors:

  • Regular expressions via robots.txt are not supported: You must use Disallow: /tmp/ in stead of Disallow: /tmp/* .
  • State only 1 path per Disallow: statement.
  • Paths in a robots.txt file are like paths in a URL, in particular they are case sensitive , so they take capital and small letters into account.

Meta tags

In stead of a global robots.txt file for an entire website, you can also direct the web agents per page if that page can be submitted into a search engine. Also it is possible to express if you wish the spider to follow links on the page. This is particularly useful if your web page queries some database which you do not want a spider to visit over and over again. Use of meta tags is a part of the HTML 4 standard and information about this standard can be read at the WWW Consortium website. A somewhat less technical explanation can be found here. Using these meta-tags is extremely simple. In the HEAD section of a page, type an extra directive: <meta name="robots" content="noindex">


With this line, you tell the robot not to index the page (not to submit it into its search engine). The links on the page, however, will be followed. Use nofollow to deny this: <meta name="robots" content="noindex, nofollow">

Not using noindex means, that the page itself will be submitted, but the pages it links to will not be visited.


WiseGuys webagent

The WiseGuys webagent visits primarily dutch (.nl) websites, WAP sites however are visited worldwide. The robot identifies itself with the following textstring: Vagabondo . An alternate identification is: Bilbo . This is an old webagent, that is mainly still in use for WAP. These robots mainly gather data for parties including Track.nl and Kobala.nl.


Please use this form to contact us about our crawler.

 
Name:

Email address:

Phone number:

Subject:

Message:
+31 (0)40 293 80 17  
© WiseGuys