Web Robots (Crawlers, or Spiders) are programs that traverse the Web automatically. Search engines use them to index the web content.

Whenever search engine robotos wants to crawl the website, it looks for robots.txt which is the file where we write the instructions to the robots/crawlers about what it should crawl and what it shouldn’t. Some of the folders will be confidential, so we may don’t want those folders should be indexed or crawled by this robots. We can also specify the sitemap path in robots.txt.

robots.txt should always be placed at the first level/top-level directory of your web server, suppose http://www.example.com/ is the domain, then robots.txt should be placed http://www.example.com/robots.txt

Lets start writing robots.txt -

Exclude all robots from crawiling my entire website
User-agent: *
Disallow: /
Exclude google from crawiling my entire website
User-agent: google
Disallow: /
Exclude specific folders/files from crawiling
User-agent: *
Disallow: /admin
Disallow: /account/index.html


robots.txt can also be used to specify the sitemap path

Specify sitemap path
User-agent: *
Sitemap: http://ganeshhs.com/sitemap.xml

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Simpy
  • StumbleUpon
  • Technorati
  • YahooMyWeb