Robots txt is one of the simplest and easiest file to create, yet the most common one to be messed with.
Using robots txt file, you can exclude a page from getting indexed, and hence stop it from building its SEO value.
In this guide, we will uncover the A to Z of the robots.txt file and how you should approach it for your website.
A. What is robots txt file?
Robots.txt is a text file that resides on your web server and is used by web crawlers (primarily but not limited to search engines) to understand how to crawl your website.
Essentially, it consists of rules that are part of the REP (Robots Exclusion Protocol), a standard protocol to communicate with web crawlers and web bots.
REP enables webmasters to define the parts of their website that are available for crawl, and what’s not so that the crawler can act accordingly.
Please note that all web robots and crawlers may not cooperate with the instructions mentioned in this file.
Spambots, malware, email harvesters, security crawlers, spying bots, etc. may ignore the instructions mentioned within robots.txt file.
However, all famous search engines (including Google) respect the instructions mentioned with robots.txt file and request your web pages accordingly.
B. How does robots.txt file work?
Once the search engine crawler lands on a website, it looks for the robots.txt file in the root folder.
If it finds one, it will act as per the instructions mentioned in the robots file and crawl only the allowed parts of the website.
However, if there are no specific instructions or the robots file is missing, the search crawler may crawl the entire information available on the website.
C. What is the need for robots.txt file?
Irrespective of the size of your website, having a robot’s txt file is a good to have thing.
This file becomes crucial as your website grows in size and you need more control over the ways search engines are crawling your pages.
Some of the top use cases of robots file include:-
- Making parts of your website hidden from search engines (e.g. your test environment)
- Blocking the duplicate content from getting crawled (though I would suggest you use canonical tags for this)
- Blocking search bots from crawling specific content assets available on your website (e.g. PDFs, images, videos, etc.)
- Mentioning the sitemap and pointing the search crawlers to it
- Saving your crawl budget by hiding the less important pages (e.g. policy page, terms and conditions page, etc.)
- Ensuring your web server doesn’t get overloaded by web crawlers
D. Understanding the basic format of robots.txt file
In its most simple form, robots txt file looks as follows:-
[Directive #1] [Directive #2] [More Directives] User-agent: [User Agent Name2] [Directive #1] [Directive #2] [More Directives] sitemap: [Complete URL of your sitemap]
User-agent: [User Agent Name2]
sitemap: [Complete URL of your sitemap]
If you have observed carefully, there are 2 major blocks in the robots file. Let’s discuss each of them one by one
User-agent refers to the web crawler for which the rule has been written for.
Here are some of the famous user agents:-
- Google search: Googlebot
- Google News: Googlebot-News
- Google Adsense: Mediapartners-Google
- Bing: Bingbot
- Yahoo: Slurp
- DuckDuckGo: DuckDuckBot
- Yandex: YandexBot
- Alexa: is_archiver
Do note that user agents are case sensitive.
Also, each user agent should have its own set of directives that define the way that a particular crawler should act.
If you want to have common rules for all the agents, it’s best to use an asterisk (*) under user-agent.
There are 3 major directives that are in use today
This directive defines which all pages should be left out by that particular user agent. This works for both at a specific URL level or at the directory level.
For example, the below code is a directive to all bots not to crawl the policies folder of the website.
If you want just one URL in a subfolder to be crawled, you can use the allow directive.
For example, the below code is a directive to all bots not to crawl the policies folder of the website, except refund policy.
Mentioning sitemap with robots file is an optional but a good practice to follow.
Robots txt is the first file that the search crawler looks for after landing on your website.
The availability of the sitemap URL makes the crawler’s job easy as it can use the sitemap to build an understanding of your website content.
Just FYI, Google accesses your sitemap via the search console, but this can be good practice for other user-agents.
E. Robots txt file example
Here’s are a couple of examples of robots.txt file:-
Example 1: Allowing all web crawlers to access all parts of the website
Example 2: Blocking all web crawlers from the website
Example 3: Blocking all web crawlers from accessing the staging server, and Pingdom bot from accessing the entire website.
# Block specific robot completely
F. Where does robots.txt file reside?
Having a robots.txt file on your server is not just enough. It needs to reside on the root folder of your website. Placing it in a subfolder renders it useless.
That means you should be in a position to see your file content when you input your root domain followed by /robots.txt (case sensitive).
If you encounter a 404 server error or the URL to access your robots file is different from what’s mentioned above, consider that your robots file is not properly set.
G. Does robots.txt ensure no crawl by Google?
Please note that robots.txt file is not a tool to hide your website from Google search results.
Google can still crawl and index that page if it is linked to from other web pages. This has been clearly mentioned in Google’s documentation, as shown below:-
Source: Google Search Console
If you want your website not to be crawled and indexed by Google, better use a no-index meta directive.
Here’s how your web page will look on SERPs in case Google crawls it despite being disallowed via the robots file.
If you want more information on how Google interprets robots txt file, check here.
H. How to create robots txt file for WordPress?
- Using plugins
- Manual process
Simply, open a notepad file, write the instructions in it and save it. You have your robots txt file ready.
Next, upload it on your server in the root directory.
I. Best practices for robots.txt file
- There has to be only one robots file on your website, that too within the root folder.
- This is a case sensitive file i.e. /robots.txt and /Robots.txt are 2 distinct files, with the former being the right one.
- The maximum file size allowed by Google for robots.txt is 500kb. Everything after this size gets ignored.
- You can’t compress a robots.txt file. You have to optimize the directives to reduce its size.
- There should be a separate robots.txt file for every subdomain.
- It’s a best practice to write each instruction on a separate line.
- Make the job of search bots easy by including your sitemap(s) path within the robots file.
- Any text mentioned after a hash (#) represents a comment in that line.
J. Final words
Robots.txt file is simple yet extremely powerful.
You make one mistake and it can hide your entire existence from search engine results.
Therefore, tread cautiously when making changes to this file