Robot Exclusion

SiteSurfer's indexer contains a robot--also known as a a web crawler or web spider--that can traverse an entire web site or HTML tree by parsing pages to extract links to other pages.

Some web site authors may wish to restrict what pages are chased by robots for inclusion in an index. For instance, it would make little sense to index a page that will be deleted tomorrow.

There are two voluntary methods of excluding such robots from a web site. First, you may place a file named robots.txt in the root directory of the site that directs robots not to access certain areas of the site, using Disallow: and Allow: lines. The name must be lowercase, not UPPERCASE nor MixedCase. For instance, the DevTech web site may might have:

http://www.devtech.com/robots.txt

However, it is useless to put the robots.txt file in a non-toplevel directory, like:

http://www.geocities.com/myhomepage/robots.txt

Here the robots.txt is not in the root of the web server, and robots will not adhere to its content.

Secondly, a page may contain a special Meta tag that tells a robot to not index the page or not chase links from the page, or both. This method is more flexible for web site owners who do not have permission to modify the web server's robots.txt file. A typical use of this meta tag looks like this:

<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</HEAD>
<BODY>...

The above example directs robots to neither index this page, nor follow links from this page. There are four possible combinations for the CONTENT parameter:

"NOINDEX, NOFOLLOW"
"NOINDEX, FOLLOW"
"INDEX, NOFOLLOW"
"INDEX, FOLLOW"

More information on using robots.txt and the ROBOTS meta tag can be found at:

http://info.webcrawler.com/mak/projects/robots/exclusion.html

By default, SiteSurfer will respect these conventions for indexing web sites and other HTML trees. However, you may use the Robots notebook page to turn off recognition of robots directives. This could be useful, for instance, when indexing your own site for use with SiteSurfer, and you want to index some pages that are off-limits to other robots. Remember, under most circumstances adhering to these protocols is the courteous thing to do.

Additionally, SiteSurfer will respect the robots meta tag in all HTML content, but will not process robots.txt in the Local Mirror mode.