Robots.txt Guide to Help You Get Started

Robots.txt Guide to Help You Get Started

Robots txt is one of the simplest and easiest file to create, yet the most common one to be messed with.

Using robots txt file, you can exclude a page from getting indexed, and hence stop it from building its SEO value.

Having a proper robots file is an essential element of on-page SEO. In this guide, we will uncover the A to Z of the robots.txt file and how you should approach it for your website.

A. What is a robots.txt file?

Robots.txt is a text file that resides on your web server and is used by web crawlers (primarily but not limited to search engines) to understand how to crawl your website.

Robots.txt file
Robots.txt file tells search engines which pages to crawl and which to avoid

Essentially, it consists of rules that are part of the REP (Robots Exclusion Protocol), a standard protocol to communicate with web crawlers and web bots.

REP enables webmasters to define the parts of their website that are available for crawl, and what’s not so that the crawler can act accordingly.

Please note that all web robots and crawlers may not cooperate with the instructions mentioned in this file.

Spambots, malware, email harvesters, security crawlers, spying bots, etc. may ignore the instructions mentioned within robots.txt file.

However, all famous search engines (including Google) respect the instructions mentioned with robots.txt file and request your web pages accordingly.

B. How does robots.txt file work?

Once the search engine crawler lands on a website, it looks for the robots.txt file in the root folder.

If it finds one, it will act as per the instructions mentioned in the robots file and crawl only the allowed parts of the website.

However, if there are no specific instructions or the robots file is missing, the search crawler may crawl the entire information available on the website.

C. What is the need for robots.txt file?

Irrespective of the size of your website, having a robot’s txt file is a good to have thing for your SEO.

This file becomes crucial as your website grows in size and you need more control over the ways search engines are crawling your pages.

Some of the top use cases of robots file include:-

  • Making parts of your website hidden from search engines (e.g. your test environment)
  • Blocking the duplicate content from getting crawled (though I would suggest you use canonical tags for this)
  • Blocking search bots from crawling specific content assets available on your website (e.g. PDFs, images, videos, etc.)
  • Mentioning the link to your sitemap and pointing the search crawlers to it
  • Saving your crawl budget by hiding the less important pages (e.g. policy page, terms and conditions page, etc.)
  • Ensuring your web server doesn’t get overloaded by web crawlers

D. Robots.txt syntax

In its most simple form, robots txt file looks as follows:-

User-agent: [User Agent Name1]

[Directive #1]

[Directive #2]

[More Directives]

 

User-agent: [User Agent Name2]

[Directive #1]

[Directive #2]

[More Directives]

 

sitemap: [Complete URL of your sitemap]

If you have observed carefully, there are 2 major blocks in the robots file. Let’s discuss each of them one by one

#1. User-agent

User-agent refers to the web crawler for which the rule has been written for.

Here are some of the famous user agents:-

  • Google search: Googlebot
  • Google News: Googlebot-News
  • Google Adsense: Mediapartners-Google
  • Bing: Bingbot
  • Yahoo: Slurp
  • DuckDuckGo: DuckDuckBot
  • Yandex: YandexBot
  • Alexa: is_archiver

Do note that user agents are case sensitive.

Also, each user agent should have its own set of directives that define the way that a particular crawler should act.

If you want to have common rules for all the agents, it’s best to use an asterisk (*) under user-agent.

#2. Directives

There are 3 major directives that are in use today

a. The ‘Disallow’ directive 

This directive defines which all pages should be left out by that particular user agent. This works for both at a specific URL level or at the directory level.

For example, the below code is a directive to all bots not to crawl the policies folder of the website.

User-agent: *

Disallow: /policies

b. TheAllow’ directive

If you want just one URL in a subfolder to be crawled, you can use the allow directive.

For example, the below code is a directive to all bots not to crawl the policies folder of the website, except refund policy.

User-agent: *

Disallow: /policies

Allow: /policies/refund

c. Sitemap

Mentioning the link to your sitemap within the robots file is an optional but a good practice to follow.

Robots txt is the first file that the search crawler looks for after landing on your website.

The availability of the sitemap URL makes the crawler’s job easy as it can use the sitemap to build an understanding of your website content.

Just FYI, Google accesses your sitemap via the search console, but this can be good practice for other user-agents.

E. Robots.txt file examples

Here’s are a couple of examples of robots.txt file:-

Example 1: Allowing all web crawlers to access all parts of the website

User-agent: *

Disallow:

Example 2: Blocking all web crawlers from the website

User-agent: *

Disallow: /

Example 3: Blocking all web crawlers from accessing the staging server, and Pingdom bot from accessing the entire website.

User-agent: *

Disallow: /staging1

# Block specific robot completely

User-agent: Pingdom

Disallow: /

F. Where does robots.txt file reside?

Having a robots.txt file on your server is not just enough. It needs to reside on the root folder of your website. Placing it in a subfolder renders it useless.

That means you should be in a position to see your file content when you input your root domain followed by /robots.txt (case sensitive).

If you encounter a 404 server error or the URL to access your robots file is different from what’s mentioned above, consider that your robots file is not properly set.

G. Does robots.txt ensure no crawl by Google?

Please note that robots.txt file is not a tool to hide your website from Google search results.

Google can still crawl and index that page if it is linked to from other web pages. This has been clearly mentioned in Google’s documentation, as shown below:-

Search console help_robots.txtSource: Google Search Console

If you want your website not to be crawled and indexed by Google, better use a no-index meta directive.

Here’s how your web page will look on SERPs in case Google crawls it despite being disallowed via the robots file.

Page results after Google crawls it despite being disallowed via the robots file.

If you want more information on how Google interprets robots txt file, check here.

H. How to create robots.txt file for WordPress?

  • Using plugins

You can use WordPress SEO plugins (like Yoast) to create and edit the robots.txt file. Check the detailed tutorial here.

  • Manual process

Simply, open a notepad file, write the instructions in it and save it. You have your robots txt file ready.

Next, upload it on your server in the root directory.

I. Best practices for robots.txt file

  • There has to be only one robots file on your website, that too within the root folder.
  • This is a case sensitive file i.e. /robots.txt and /Robots.txt are 2 distinct files, with the former being the right one.
  • The maximum file size allowed by Google for robots.txt is 500kb. Everything after this size gets ignored.
  • You can’t compress a robots.txt file. You have to optimize the directives to reduce its size.
  • There should be a separate robots.txt file for every subdomain.
  • It’s a best practice to write each instruction on a separate line.
  • Make the job of search bots easy by including your sitemap(s) path within the robots file.
  • Any text mentioned after a hash (#) represents a comment in that line.

J. Frequently Asked Questions (FAQs)

#1. Is it necessary to have a robots.txt file?

It’s not necessary to have robots.txt file, but it is a recommended practice. Search engine bots refer to this file as the first thing before they crawl a website. Its presence ensures a proper crawl.

 

#2. What is the crawl delay directive in robots.txt file?

It is an unofficial directive that slows down the crawling rate of the search crawlers so that the web server is not overloaded.

For Google, you need to set the crawl rate because it does not support the crawl delay directive.

 

#3. What would happen If I don’t have a robots.txt file?

If search crawlers cannot find the robots.txt file in the top directory, then they will crawl and index every element on your website.

 

#4. How to check whether a robots.txt file is working or not?

You can check with this free robots.txt Tester tool provided by Google.

 

#5. Can I prevent other people from reading my robots.txt file?

No, the robots.txt file is accessible to everyone. If you do not want to disclose some files or folders to the public, then do not include them in the robots.txt file. 

K. Final words

Robots.txt file is simple yet extremely powerful.

You make one mistake and it can hide your entire existence from search engine results.

Therefore, tread cautiously when making changes to this file

Leave a Comment