How to Optimize Robots.txt: A Detailed Guide

Optimize robots.txt

The robots.txt is a file located in the root directory of the domain and contains directions for search engine bots.

This file is used to prevent bots from crawling certain directories that shouldn’t be in search results and simultaneously save the crawl budget if optimized properly.

Now, when we mention search engine bots, most people will think of GoogleBot.

However, the robots.txt file is used to give instructions to all bots on the web, including Bing, Yandex, and similar.

We will try through this article to explain some facts about the robots.txt file and later how to optimize it. 

Please note, optimizing the robots.txt file requires some technical knowledge, so if you are a beginner, avoid changing instructions inside this file (at least on the live website) as it may seriously hurt your rankings.

Why is Robots.txt important?

To get your website indexed, search engines first must crawl your website and discover its pages. Or in short, they must know that your website exists.

To perform the process of crawling your website, search engines use bots. 

The robots.txt file here comes in handy to prevent them from crawling unnecessary directories or URLs.

To perform the crawling of a website, however, search engines need resources which by the way are limited.

So, to proceed with this action, each search engine assigns a dedicated part of its resources to each website to prevent server overload and wasting the so-called crawl budget.

Is it possible to know what crawling budget is assigned to the website? We cannot answer this question as no number says “Our bot can crawl an xxx number of pages per day on your website”.

For example, if you have a blog where you publish or edit your content once per week, Google will assign you a small portion of its crawl resources. If you, on the other hand, have a News website where you update your content on a daily basis, the crawl budget in this case will be larger as your website has to be crawled daily for any content updates.

You got the point. The more frequently you update your website, search engine bots will visit your website more often to discover new or updated content.

What about wasting the crawl budget? We will use an eCommerce store as an example.

Why? Because in lots of these cases online, eCommerce website owners forget to properly optimize their robots.txt file. We will show you where they are wrong.

Let’s say you have an eCommerce website with 10 categories and with dynamic filters. There you have 100 different products in different colors, sizes, etc.

In that case, when you start exploring the shop, looking at products, and using the dynamic filters, everything looks fine and user-friendly. However, with each new filter you apply, a new URL is generated in the browser.

Something like this:

https://mycoolstore.com/shop/shoes?color=red&size=41

Now, the main URL of that category is like this 

https://mycoolstore.com/shop/shoes

Both of these URLs are on the same page on your website, and you may not even be redirected to a new page after you apply your filter. However, in terms of search engine bots, these URLs are two separate pages.

With such an approach, your website may generate an unlimited number of different URLs with various combinations of filters.

This is the major problem that most website owners miss when building an eCommerce site which causes a loss of the crawl budget and also gets in Google index pages that shouldn’t be there.

The main point should be to focus on quality pages and give them more space so Google won’t waste both crawl budget and time on discovering important pages.

How robots.txt works

Simple, the robots.txt file gives instructions to the search engine bots on where to go and where not to go.

In this way, you protect your website from wasting the crawl budget. And of course, you avoid getting the wrong pages to get inside the Google index.

So, what does the robots.txt look like?

Here’s an example of the default robots.txt file

User-agent: *

Disallow: /wp-admin/

Allow: /

This is an example used from a WordPress website, however, depending on the platform, the robots.txt will have a different set of rules. We will talk more about them a bit later.

Where to find the robots.txt file

This file is located in the root directory of the domain. If you put it in a sub-directory, web crawlers won’t search for it nor will follow its instructions.

We will use a few examples to show you where to find it. Let’s say you have a website (doesn’t matter the topic). The website address is https://mycoolstore.com. 

In that case, you can find the robots.txt on the next URL: https://mycoolstore.com/robots.txt.

However, there are a few tricky things you should keep in mind.

https://mycoolstore.com/robots.txt is valid only for the https://mycoolstore.com/ domain. If you have any subdomains like https://blog.mycoolstore.com or https://news.mycoolstore.com you will need separate robots.txt for each subdomain.

The same thing goes for www and non-www versions of the website. The URLs

https://mycoolstore.com/

and https://www.mycoolstore.com/ 

are considered two different domains, so be careful when adding robots.txt. Make sure you first choose the primary version of your domain, then configure the robots.txt file.

HTTP and HTTPS are no different compared to the previous example. Let’s see how.

https://mycoolstore.com/

https://www.mycoolstore.com/

http://mycoolstore.com/ 

http://www.mycoolstore.com/ 

The above examples show different combinations of www, non-www, HTTP, and HTTPS versions of the website.

So, if you put robots.txt inside the domain https://mycoolstore.com/robots.txt it’s valid only for that domain version.

For the other three versions of the domain like 

https://www.mycoolstore.com/

http://mycoolstore.com/ 

http://www.mycoolstore.com/ 

the robots.txt file is not valid.

How many robots.txt files are on the website?

One domain supports only one robots.txt file. You can’t have multiple robots.txt files on the website.

However, if you have multiple subdomains with separate folders you can have one robots.txt file per subdomain.

For example, you can have different robots.txt files for these URLs

  • https://blog.example.com/robots.txt
  • https://news.example.com/robots.txt
  • https://example.com/robots.txt

But for a URL like this https://example.com/blog, you can’t have separate robots.txt because this is just a subdirectory under the domain.

Robots.txt instructions

Let’s break down how search engine bots interpret certain instructions and how you can read the robots.txt file.

The instructions are:

  • User-agent – for which search engine bot rules apply. If you put the * symbol, it means all search engines will follow the instructions.
  • Allow – allow search engine bots to crawl this directory of the website.
  • Disallow – prevent search engine bots from crawling this directory.
  • Sitemap – link to the sitemap.xml file.
  • Comments – not a directive, starts with a # symbol, so search crawlers will ignore it.
  • Crawl-delay – not supported by all search engines. Used to give instructions on how long search engines have to wait before requesting a new URL.

Sounds simple, however, let’s explore how each of these directives works in practice.

User-agent

By using this instruction, you are telling which web crawler should follow the instructions.

Some of the most popular web crawlers are

  • Google: Googlebot
  • Bing: Bingbot
  • Yahoo: Slurp Bot
  • DuckDuckGo: DuckDuckBot
  • Baidu: Baiduspider
  • Yandex: Yandex Bot

Some of these web crawlers have their own, let’s say, sub-crawlers. We’ll name a few of them: Googlebot-News, AdsBot-Google, Baiduspider-image, etc.

How to use User-agent instructions inside the robots.txt file?

You can use User-agent multiple times in one robots.txt file and give different instructions for each web crawler. By default, most website platforms create this file with directions regarding all web crawlers.

When and why should you give instructions for different web crawlers?

Let’s say you want your website indexed by Google, Bing, etc. but you don’t want to see it crawled and indexed by Yandex and Baidu. In that case, you will need to stop those two from crawling your website.

In that case, your robots.txt would look like this

User-agent: *

Allow: /

User-agent: Baiduspider

Disallow: /

User-agent: Yandex Bot

Disallow: /

As you can see, Baidu and Yandex are blocked from crawling your website, while other web crawlers online are free to proceed.

You can also go into more in detail when giving instructions to the web crawlers. As we mentioned, some search engines have multiple crawlers.

For example, Google web crawlers have additional crawlers. They are

  • AdsBot-Google-Mobile
  • Mediapartners-Google
  • Googlebot-Image/1.0
  • Googlebot-News
  • Googlebot-Video/1.0
  • etc.

So, besides giving different instructions across web crawlers from various search engines, you can give specific instructions for various Google web crawlers too.

Search engine crawlers will follow the group of instructions below the User-agent instruction. So if you are creating a different set of instructions for different web crawlers, make sure they have individual User-agents.

Allow

The instruction Allow permits search engine bots to crawl a certain directory.

For example, to allow bots to crawl your website, just use the simple directive below:

Allow: /

The empty slash instructs web crawlers they can access the whole website. You can also forbid crawlers to access specific directories. We will now explore how to do that.

Disallow

When you want to prevent search engine bots to crawl specific paths of your website you use the Disallow instruction.

We will use WordPress as an example. The default WordPress robots.txt file looks something like this:

User-agent: *

Allow: /

Disallow: /wp-admin/
In this case, the crawlers are instructed to avoid crawling the /wp-admin/ path of your website.

Sitemap

Despite being optional, it is recommended to have a link to the sitemap.xml inside your robots.txt file. This way you make it easier for search engines to discover content on your site.

If you already submitted your sitemap through the Google Search Console, then having a URL inside the robots.txt file may be unnecessary.

However, the advice is to keep the URL inside the robots.txt file so the other search engine bots can find and index the important pages on your website.

A sitemap is located at the bottom of the robots.txt file, separated by one empty row.

The structure of that line is simple and looks like this

Sitemap: https://mycoolstore.com/sitemap.xml

However, this is just an example, depending on the platform, sitemap.xml doesn’t strictly have to be in the root folder like the robots.txt file.

A robots.txt file can also contain links to multiple sitemaps, it is not limited to one sitemap only. So if you have different sitemaps for categories, products, CMS pages, etc. include them all.

Keep in mind that other robots.txt directives only show the paths that should or should not be accessed. When it comes to the sitemap directive, enter the full URL, not just a path.

For example

Sitemap: https://mycoolstore/sitemap.xml is correct.

Sitemap: sitemap.xml is incorrect.

Comments

You can also add comments in the robots.txt file. They start with a hash (#), so make sure to include it before putting a comment so the search engine crawlers know they need to ignore it.

They are mostly used to write notes regarding the instructions below if somebody else wants to change the file.

Or you can use it just for fun. It’s up to you!

Crawl-delay

This is a little specific instruction as Googlebot is the only crawler that won’t recognize it while others like Bing and Yahoo will understand it.

What crawl-delay is for? By using this instruction you tell the crawlers to slow down to avoid overloading your servers, especially if you have a huge website.

However, if you put this instruction when optimizing robots.txt for Googlebot, it will just ignore it. The reason why Google does this is that you can adjust crawl intensity through the Google Search Console, so adding it to the robots.txt file is unnecessary.

The same principle applies to the Baidu search engine.

But, search engines that support this directive are Bing, Yahoo, and Yandex.

The best way to explain this is through an example.

If you set the directive like this

Crawl-delay: 5

it means that search engine crawlers will wait 5 seconds before they request another URL. Keep in mind that this is an advanced technique and doesn’t mean you necessarily need to add it to the robots.txt.

How to create a robots.txt file

Creating the robots.txt file is simple, you just need an ordinary text editor on your PC. Also, keep in mind that some online platforms already come with the default robots.txt file so in that case, you won’t have to create a new file or modify the current one.

Let’s see how can you create it manually.

Depending on the operating system, open your text editor (that may be Notepad, Text Editor, etc.)

Now, in your newly opened window enter the following lines:

User-agent: *

Allow: /

Sitemap: (enter URL of your sitemap here).

If you don’t have a sitemap.xml file created for your website, don’t worry. You can easily create and upload it later, then edit the robots.txt.

When you enter those lines, save your file as robots.txt (make sure the extension is .txt or web crawlers won’t recognize it in other formats).

This is the simplest form, allowing all online crawlers to go through your website.

Robots.txt file examples

As we said before, some platforms already come with the default robots.txt files. However, you can later optimize it further according to your needs, or if you can’t find robots.txt on your website, you can copy and paste these examples.

Robots.txt file in WordPress

WordPress comes with a very simple structure for its website.

The default file looks something like this

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

You can see there is no link to the sitemap file. In this case, you have manually edit the robots.txt and add the sitemap file.

Besides that, you can create a slight modification to the default file and make it look like this:

User-agent: *

Disallow: /wp-admin/

Allow: /

Sitemap: https://mycoolstore.com/sitemap.xml

If you don’t have this file inside your WordPress website for some reason, you can copy the above example with a sitemap from here.

Of course, you can use plugins to create the robots.txt file in your WordPress websites like Yoast or RankMath. They also offer the possibility to create and customize it through the WP dashboard without having to go into your hosting provider.

Robots.txt file in Adobe Commerce (former Magento)

Adobe Commerce (or former Magento) is a great platform for huge and complex eCommerce websites, especially for multilingual websites.

They already come with default robots.txt file instructions so at the first you just have to configure whether you want your website to be indexed or not (depending on the development phase).

To configure robots.txt in Magento, go to Content > Design > Configuration.

Now find the Store you want to configure, then go to the Search Engine Robots section.

There you can find the Default Robots dropdown and robots.txt instructions below. The instructions for default robots have four options:

  1. Index, Follow
  2. Index, Nofollow
  3. Noindex, Follow
  4. Noindex, Nofollow

After you selected which one you need, click Save and flush the cache.

As for the robots.txt file instructions, the default settings are

Disallow: /lib/

Disallow: /*.php$

Disallow: /pkginfo/

Disallow: /report/

Disallow: /var/

Disallow: /catalog/

Disallow: /customer/

Disallow: /sendfriend/

Disallow: /review/

Disallow: /*SID=

For catalog search pages

Disallow: /catalogsearch/

Disallow: /catalog/product_compare/

Disallow: /catalog/category/view/

Disallow: /catalog/product/view/

Disallow URL Filter Searches

Disallow: /*?dir*

Disallow: /*?dir=desc

Disallow: /*?dir=asc

Disallow: /*?limit=all

Disallow: /?mode

For common folders

Disallow: /app/

Disallow: /bin/

Disallow: /dev/

Disallow: /lib/

Disallow: /phpserver/

Disallow: /pub/

Checkout

Disallow: /checkout/

Disallow: /onestepcheckout/

Disallow: /customer/

Disallow: /customer/account/

Disallow: /customer/account/login

Common files

Disallow: /composer.json

Disallow: /composer.lock

Disallow: /CONTRIBUTING.md

Disallow: /CONTRIBUTOR_LICENSE_AGREEMENT.html

Disallow: /COPYING.txt

Disallow: /Gruntfile.js

Disallow: /LICENSE.txt

Disallow: /LICENSE_AFL.txt

Disallow: /nginx.conf.sample

Disallow: /package.json

Disallow: /php.ini.sample

Disallow: /RELEASE_NOTES.txt

Technical Magento files

Disallow: / api.php

Disallow: / cron.php

Disallow: / cron.sh

Disallow: / error_log

Disallow: / get.php

Disallow: / install.php

Disallow: / LICENSE.html

Disallow: / LICENSE.txt

Disallow: / LICENSE_AFL.txt

Disallow: / README.txt

Disallow: / RELEASE_NOTES.txt

And CMS folders

Disallow: /app/

Disallow: /bin/

Disallow: /dev/

Disallow: /lib/

Disallow: /phpserver/

Disallow: /pub/

Of course, if you have other custom directories you want to restrict from crawling or to prevent crawling URLs generated from dynamic filters you’ll have to edit the file manually. More on this you can read later in the URL paths section.

Restricting specific crawl bots from accessing your website

We mentioned above already, you can allow Googlebot and Bingbot to crawl your website and simultaneously restrict other bots from crawling your website.

We’ll create a few examples on showing you how you can do this.

Example of the robots.txt that will block Yandex and Baidu web crawlers.

User-agent: *

Allow: /

User-agent: Baiduspider

Disallow: /

User-agent: Yandex Bot

Disallow: /

Sitemap: https://mycoolstore.com/sitemap.xml

Let’s explain it. The first line says that the below instruction goes for all search engines, however, the second and third User-agent instructions are blocking Yandex and Baidu from crawling. With this approach, you allow all web crawlers except for those two. This is easier than creating instructions for each web crawler separately.

Example of the robots.txt file with instructions for multiple crawlers with Crawl-delay instruction.

User-agent: *

Allow: /

User-agent: Bingbot

Allow: /

Crawl-delay: 5

Explanation: in this case, all web crawlers are allowed to go through the website, however, the Bingbot is instructed to request the next URL after 5 seconds.

If you want to add a delay for Yahoo to the example above, you should insert the next few lines:

User-agent: Slurp

Allow: /

Crawl-delay: 10

The crawl delay in this case tells the Yahoo bot to wait 10 seconds before requesting another URL.

Of course, in each of these examples, don’t forget to include sitemap(s) at the end of the document.

How to structure your robots.txt file

When creating your first robots.txt file, you should pay attention to how you structure it.

For example, you may do everything right regarding the instructions and sitemap. However, if you put everything in one line like this

User-agent: * Allow: / Disallow: /wp-admin/ Sitemap: https://mycoolstore.com/sitemap.xml

or

User-agent: * Allow: / Disallow: /wp-admin/ 

Sitemap: https://mycoolstore.com/sitemap.xml

the web crawlers won’t understand this and will ignore your robots.txt file. It will look like it doesn’t exist.

The proper structure requires one instruction per line. Like this

User-agent: * 

Allow: / 

Disallow: /wp-admin/ 

Sitemap: https://mycoolstore.com/sitemap.xml

Small mistakes like this can have a huge impact on website crawlability. Always double-check your robots.txt file before uploading or saving it.

URL paths and values

If you have built a complex website and want certain paths and URLs to be hidden from search results or you want to avoid wasting the crawl budget, you will certainly need to use advanced techniques in your robots.txt file.

But, did you know that one missed symbol may give web crawlers completely different directions?

We will see what you need to pay attention to and also show you how the asterisk symbol (*) can act as a wild card.

Disallow: /shoes

URLs blocked with this instruction could be

  • /shoes
  • /shoes/red
  • /shoes?=id_123
  • /shoes.php
  • /shoesman

Now, we will create a small change to this code in something like this

Disallow: /shoes/

You can see, we added a slash at the end. Looks unimportant, but the URLs blocked now are different. Examples

  • /shoes/red
  • /shoes/?=id_123
  • /shoes/red/41

You see, the whole directory is restricted.

However, the URLs like this

  • /shoes
  • /shoes?=id_123
  • /shoes.php
  • /shoesman

won’t be blocked.

In short, /shoes blocks any URL path with that start, while /shoes/ blocks everything inside that folder.

What about wild cards?

You can use the asterisk (*) symbol that will act as a wild card. This can be useful when blocking URLs with certain parameters (like search results, dynamic filtering, etc.)

Disallow: /*?text-direction=rtl

The example is used from the website with the possibility to change text direction.

In this direction URLs blocked from crawling are

  • /red-men-shoes?text-direction=rtl
  • /black-shoes?text-direction=rtl
  • /womens-white-shoes?text-direction=rtl

However, the URLs

  • /red-men-shoes
  • /black-shoes
  • /women-white-shoes

are still crawlable. 

With this approach, you will avoid duplicate content and duplicate pages in search results saving the crawl budget at the same time.

The same rule is applied to blocking the search results from crawling. The direction used on the same website is

Disallow: */?s=

With this direction, each page with a URL containing the /?s= parameter won’t be crawled.

And that’s it regarding the explanation of wild cards and parameters.

Here’s an example from a WordPress and WooCommerce website:

User-Agent: *

Allow: /wp-content/uploads/

Disallow: /wp-content/plugins/

Disallow: /wp-admin/

Disallow: /*?text-direction=rtl

Disallow: /*?text-direction=ltr

Disallow: */?s=

Disallow: */?

Here you can see directions referring to all web crawlers. Pages with left-to-right and right-to-left parameters are blocked as well as any URLs from search results or with query strings.

How to check robots.txt

At any moment, you can check your robots.txt file. This is done through the official Google tester. However, you must have at least one property in Google Search Console, or you won’t be able to perform any test.

To test it, head to the Robots.txt Testing Tool. Select the desired property. When you select a property, Google will show you the robots.txt inside that website (domain, subdomain).

When the editor loads, you can see what your current robots.txt look like. Below the editor is a field for your URL, however, you don’t have to enter the full URL, just a path.

When you click on the red Test button, you will see a notification telling you whether that URL is allowed or restricted.

Besides that, you can also edit your robots.txt file inside the Google Robots testing tool and check how it will behave with a new instruction. But keep in mind that the editor won’t save those changes directly to your robots.txt file.

If you want to use the instructions from the Google Robots tool, you have to copy them and then edit them through hosting, plugin, or extension (depending on your platform).

Robots.txt recommendations

We covered the most important parts of the robots.txt configuration. However, there are still some recommendations and unanswered questions.

Is robots.txt case-sensitive

Yes! Definitely.

When you take a look at Google’s documentation for the robots.txt file you can see they mentioned this.

For example, if you want to block your images directory from crawling, you will put something like this

Disallow: /img/

However, this won’t block the URLs 

  • /IMG/
  • /Img/
  • etc.

For this, you need to put additional directives to the robots.txt file. But that won’t be necessary if you don’t have multiple directories like the above.

Robots.txt or noindex

This question bothers lots of people in SEO.

Most think adding noindex on the URL with a mixture of disallow directives in robots.txt will prevent the URL from appearing in the search results.

But…

Mixing them both doesn’t add any weight and won’t speed up deindexing of a webpage, on the contrary.

These two directives shouldn’t be mixed as they may confuse Google and get that page still indexed.

Imagine you have like 100 URLs you don’t want to see in the search results. By using robots.txt it will require constant updates and adding 100 new lines for each of them.

Well, you get it.

The best practice is to use the noindex directive on the page you want to prevent it from appearing in search results. Keep the robots.txt directives for specific directories or parameters.

Robots.txt vs sitemap.xml difference

Compared to the robots.txt file, the sitemap.xml has a different purpose.

While robots.txt is used to give instructions to the search engine crawlers on what directions to crawl, the sitemap.xml contains the most important URLs on your website.

Usually, the sitemap.xml file contains links to website blog posts, categories, CMS pages, and products.

Of course, if you find it important to index author pages, you can include them too. If necessary, you can add tag pages, images, custom taxonomies, etc.

It is crucial to include all URLs you find important on your website.

Also, the main purpose of the sitemap.xml file is to make it easier for web crawlers to discover important pages of your website and index them.

Final words

Through this article, you had the opportunity to learn more details about the robots.txt file, how it works with different web crawlers, and what instructions you can use.

Of course, use the examples above as a guideline in optimizing your robots.txt file but keep in mind that this is a more advanced part of SEO.

You can use parts of the robots.txt from the above if they suit your needs, however, if you are not sure which instructions to put, start with testing throughout the abovementioned Google tool or in different environments on your website.

Avoid changing anything on the live website unless you’re 100% sure about what you are doing.

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments