How to Control Search Engine Robots

Posted by Michael Rock | SEO Consultation | Wednesday 13 September 2006 3:35 pm

Michael Rock

How to Control Search Engine Robots

Wouldn’t it be nice to be able to leave some code in your web site to tell the search engine spider crawlers to make your site number one? Unfortunately a robots.txt file or robots meta tag won’t do that, but they can help the crawlers to index your site better and block out the unwanted ones.

First a little definition explaining:

Search Engine Spiders or Crawlers - A web crawler (also known as web spider) is a program which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, recursively browsing the Web according to a set of policies.

Robots.txt - The robots exclusion standard or robots.txt protocol is a convention to prevent well-behaved web spiders and other web robots from accessing all or part of a web site. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the web site.

The robots.txt protocol is purely advisory, and relies on the cooperation of the web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy. Many web site administrators have been caught out trying to use the robots file to make private parts of a web site invisible to the rest of the world. However the file is necessarily publicly available and is easily checked by anyone with a web browser.

The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final ‘/’ character appended: otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

Meta Tag - Meta tags are used to provide structured data about data.

In the early 2000s, search engines veered away from reliance on Meta tags, as many web sites used inappropriate keywords, or were keywords stuffing to obtain any and all traffic possible.

Some search engines, however, still take Meta tags into some consideration when delivering results. In recent years, search engines have become smarter, penalizing web sites that are cheating (by repeating the same keyword several times to get a boost in the search ranking). Instead of going up rankings, these web sites will go down in rankings or, on some search engines, will be kicked off of the search engine completely.

Index a site - The act of ‘crawling’ your site and gathering information.

How can the robots.txt file and Meta tag help you?

In the robots.txt you can tell the harmful ‘web crawlers’ to leave your web site alone, and give helpful hints to the ones you want to crawl your site. Here is an example on how to disallow a web crawler to search your site:

# this identifies the wayback machine
User-agent: ia_archiver
Disallow: /

ia_archiver is the crawler name for the wayback machine that you may have heard of, and the / after disallow tells ai_archiver not to index any of your site. The #<message here> allows you to write comments to yourself so you can keep track of what you typed.

Type the above three lines into notepad from your computer and save it to the root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps the crawler to do its job, and helps the web site owner tell the spider what to do. Say for instance you have some data that you don’t want the crawlers to see. (Like duplicate content for other browser referrer pages) You can deter crawlers from indexing the ‘duplicate’ directory by typing this into your robots.txt file.

Or if you would like to have the robots.txt file created for you, visit www.rietta.com/robogen. To validate your robots.txt file to make sure it works properly you can visit www.searchengineworld.com/cgi-bin/robotcheck.cgi.

User-agent: *
Disallow: /duplicate/

The * after user-agent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each user-agent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create the above two commands into a robots.txt file:

# this identifies the wayback machine
User-agent: ia_archiver
Disallow: /

User-agent: *
Disallow: /duplicate/

One thing to note that is very important:

Anyone can access the robots.txt file of a site. So if you have information that you don’t want anyone to see don’t include it into the robots.txt file. If the directory that you don’t want anyone to see is not linked to from your web site the crawlers won’t index it anyway.

An alternative to blocking indexing of your site is to put a meta tag into the page. It looks like this: <meta name=”robots” content=”noindex,nofollow”>

You put this into the <head> tag of your web page. This line tells the robot crawlers not to index (search) the page and not to follow any of the hyperlinks on the page. So as an example <meta name=”robots” content=”noindex,follow”> tells the robots crawlers to not index the page, but follow the hyperlinks on this page.

Did you know that Google has its own <meta> tag?

It looks like this: <meta name=”googlebot” content=”noindex,nofollow,noarchive”> This tells the Google robot crawler not to index the page, not to follow any of the links, and not to keep from storing cached versions of your web site. You will want this done if you update the content on your site frequently. This prevents the web user from seeing outdated content that isn’t refreshed because of storage in the cache.

You can use the <meta> tag to specifically talk to Google’s robots to avoid complications or if you are optimizing your site for Google’s search engine. This concludes this month’s article.

Until the next article, have a great day!

Michael Rock
Web Development Contractor

The Internet Presence
(Web Site Design Service)

Web Ranking Consultants
(Search Engine Optimization Firm)


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.

You may reproduce this
document on your web site if you include the author’s bylines and links.
You may link to this page from your web site.

Yahoo! launches ’social search’ in Britain

Posted by Michael Rock | SEO Consultation | Tuesday 5 September 2006 3:38 pm

“Yahoo! will launch a service today that allows users to ask other people’s advice, when looking for anything from a good hotel or bar to an apple pie recipe, rather than rely solely upon electronically generated search results.”

With Yahoo now joining the social search network it won’t be long before SEO dies and have to merge with SEM. In basic web site structure SEO research I have come to find that when you compare the top ten results of a search term in Google, Yahoo, and MSN you will come up with surprising results. In Yahoo and MSN you can see that keyword density, web site structure, and easily measurable results that you can just outdo to get to the top. MSN being the easiest, with Yahoo next because backlinks coming into effect more strongly, and Google last.

With Google comparing the same basic web site structure you come up with a totally different story. You will find that you cannot rely on this data at all! Keyword density appears to have no effect with Google results. Nor does the structure, proximity, and placement of the keywords on the page. Although all of these factors are still important in Google they play a smaller part than what they used to. You will find top positions held in Google that barely even mention the keyword phrase in the copy sometimes. This is because Google grabs information from different sources to help determine high rankings.

Some of the things that Google considers are:

  1. Age of the Domain Name
  2. Age of the Web Site
  3. Age of the Backlinks Pointing to the Site
  4. Popularity of the Site through use of Social Bookmarking
  5. Popularity of the Site through ISP Information
  6. Popularity of the Site through Google’s Browser Toolbar

Did you know that your ISP (Internet Service Provider) sells the browsing habits of it’s customers to the search engines?

This information is applied to Google’s algorithm as well.

‘Social Bookmarking’ or ‘Tagging’ is a method of bookmarking a site online with just services as DIGG, Furl, and Del.icio.us. Google takes information from these sites to determine the popularity of the site and applies it to its algorithm. This is why I see SEO going from strict results of keyword density and merging with SEM that applies ways to gain web site popularity. You can read more about this information in this search engine placement article that I wrote.

With Yahoo launching a ‘social search‘ in Britain that tells me that Yahoo won’t be to far down the road in following Google’s footsteps in bringing more accurate and relavant searches.

The next thing in SEO news is Google’s CEO joining Apple’s Board of Directors.

After reading this article it makes me wonder if Google will become more integrated with Apple Computers. It is true that Eric Schmidt would offer great experience to the board, but it makes me wonder what would become of this. Did Google’s idea of creating ‘Google Computers’ phase out? Will Google become the default search engine on Apple computers?

We will find out more in the future.

This is Michael Rock signing off. Have a great day!

The Internet Presence
(Web Site Design Service)

Web Ranking Consultants
(Search Engine Optimization Firm)