A crawler should always follow the "The Robots Exclusion Protocol" and therefore whever it comes to a web site to crawl it, it first checks the robots.txt file.
www.yourdomain.com/robots.txt
Once it has processed the robots.txt file it will then proceed to the rest of your site usually starting at the index file and traversing throughout. There are quite often places on a web site which do not need to be crawled, like the images directory, data directories, etc so these are what you need to place into your robots file.
The "/robots.txt" file is simply a text file, which contains one or more records. A single record looking like this:
User-agent: *
Disallow: /
Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the web site.
A basic tobots.txt example
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
Allowing a single crawler
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Disallow:
User-agent: *
Disallow: /
To exclude a single robot
User-agent: BadBot
Disallow: /
Disallow: /
discuss this topic to forum
