What is a robots.txt file?

Search Engine Robots

Google Bots

Google sends out the following bots:

  • APIs-Google (For API developers.)
  • AdsBot-Google
    • Checks desktop web page ad quality.
  • AdsBot-Google-Mobile
    • Checks Android web page ad quality.
    • Checks iPhone web page ad quality.
  • AdsBot-Google-Mobile-Apps
    • Checks Android app page ad quality.
    • Obeys AdsBot-Google robots rules.
  • DuplexWeb-Google
    • Duplex on the web may ignore the * wildcard
  • FeedFetcher-Google
    • Feedfetcher doesn’t respect robots.txt rules
  • Google-Read-Aloud
    • Google Read Aloud doesn’t respect robots.txt rules.
  • Googlebot
    • For user-initiated requests, Google Favicon ignores robots.txt rules.
  • Googlebot-Image
    • For user-initiated requests, Google Favicon ignores robots.txt rules.
  • Googlebot-News
  • Googlebot-Video
  • googleweblight
    • Web Light doesn’t respect robots.txt rules
  • Mediapartners-Google
  • Storebot-Google

You can follow this link for more information on the list Googlebots.

The bots follows links from other websites and website sitemap files that are submitted through Google Search Console.

Follow this link for more information on crawling and Google bots.

Bing Bots

Bing sends out a bots called Bingbot, AdIdxBot, and BingPreview to follow links from other websites and website sitemap files that are submitted through Bing Webmaster Tools.

Follow this link for more information on how Bing crawls and the bots that they use.

Robots.txt File

If you decide that you do not want the search engines to crawl your website then you will have to add a robots.txt file to your website.

User-agent: Googlebot Disallow: /example-subfolder/
User-agent: Bingbot Disallow: /example-subfolder/
Sitemap: https://yourwebsite.con/sitemap.xml

The above example is a robots.txt file that tells Googlebot and Bingbot not to crawl the website.

The robots.txt file should be placed in the TLD (top level directory) of your website like so and is case sensitive so it has to be robots.txt not Robots.txt.

https://yourwebsite.com/robots.txt

If you have a subdomain like blog.yourwebsite.com then another robots.txt file specifically for that subdomain has to be added there.

https://blog.yourwebsite.com/robots.txt

Since the robots.txt file is accessible to everyone to read and is easy to find you will not want to put sensitive information here.

Here are some more examples on what you can add to your robots.txt file

Block all crawlers from all content.

User-agent: * Disallow: /

Allowing all crawlers to all content

User-agent: * Disallow: 

Blocking Googlebot from access a specific file on your website.

User-agent: Googlebot Disallow: /subfolder/

Blocking Googlebot from a specific page on your website.

User-agent: Googlebot Disallow: /subfolder/block-page.html

Do not allow specific file types from being accessed.

User-agent: *
Disallow: /*.pdf$

Linking your sitemap in your robots.txt file. Usually at the bottom.

User-agent: *
Sitemap: https://pagedart.com/sitemap.xml

Delaying the crawling of your website. This delays it for 10 seconds and you can enter 1 to 30 seconds. Google does not follow this rule.

User-agent: *
Crawl-delay: 10

Back to Digital Marketing Course

Related Content:

How do search engines crawl your website?
How do search engines index your website?
What is a sitemap?
How do search engines rank your website?
How to get Indexed and Ranked Faster?