Robots.txt is a critical tool in an SEO’s arsenal, which is used to establish rules that instruct crawlers and robots about which sections of a site should and shouldn’t be crawled. However, when it comes to editing a robots.txt file, we need to remember that with great power comes great responsibility. This is because even a small mistake could potentially deindex an entire site from search engines.
Given how important it is that a site’s robots.txt file is set up correctly, I quizzed our Professional Services team to uncover some common mistakes you’ll want to avoid so you can ensure search engines and other bots can crawl the pages that you want them to.
1. Not repeating general user-agent directives in specific user-agents blocks
Search engine bots will adhere to the closest matching user-agent block in a robots.txt file and other user-agent blocks will be ignored.
In this example, Googlebot would only follow the single rule specifically stated for Googlebot, and ignore the rest.
User-agent: *
Disallow: /something1
Disallow: /something2
Disallow: /something3User-agent: Googlebot
Disallow: /something-else
Given this, it is important to repeat general user-agent directives that apply to more specific bots when adding rules for them too.
2. Forgetting that the longest matching rule wins
When using allow rules, they will only apply if the number of characters in the matching rule is longer.
For example:
Disallow: /somewords
Allow: /someword
In the above example, example.com/somewords will be disallowed, as there are more matching characters in the disallow rule.
However, you can trick this specification by using extra wildcard (*) characters to make the allow rule longer in this example.
Disallow: /somewords
Allow: /someword*
3. Adding wildcards to the end of rules
The * wildcard character doesn’t need to be added to the end of rules in robots.txt, unless you’re using them so they are the longest matching rule, as they are broad matching at the end of the rule by default. While this doesn’t usually cause any problems, it may cause you to lose the respect of colleagues and family members.
Disallow: /somewords*
4. Not using separate rules for each subdomain and protocol
Robots.txt files should avoid including rules spanning different subdomains and protocols. Each subdomain and protocol on a domain requires its own separate robots.txt file. For example, separate robots.txt files should exist for https://www.example.com, https://www.example.com as well as subdomain.example.com.
5. Including relative sitemap directive URLs
In a robots.txt file, sitemaps cannot be indicated using a relative path, it must be absolute. For example,
- /sitemap.xml
would not be respected but
- https://www.example.com/sitemap.xml
would.
Sitemap: /sitemap.xml
Sitemap: https://www.example.com/sitemap.xml
6. Ignoring case sensitivity
Matching rules in robots.txt are case-sensitive, which means you will need to implement multiple rules in order to match different cases.
Disallow: /something
Disallow: /Something
7. Adding a non-existent trailing slash
Make sure not to add a trailing slash to a rule in robots.txt when the URL doesn’t have one as it won’t be matched. For example, disallowing /path/ when the actual URL is /path will mean that www.example.com/path will not be matched and disallowed.
Disallow: /path/
Disallow: /path
8. Not starting a disallow rule with a slash
If you’re specifying a root path in the robots.txt file, you should start rules with a slash, not a wildcard to avoid the risk of accidentally disallowing a deeper path.
This rule would only disallow every URL which sits on the root path www.example.com/something.
Disallow: /something
This rule would disallow every URL which contains ‘something’, e.g. www.example.com/stuff/something-else.
Disallow: *something
9. Forgetting Googlebot user agents can fall back to more generic user agent tokens
Googlebot user agents will fall back to the more generic user agent token if there are no specific blocks included for that particular one. For example, googlebot-news will fall back to googlebot if there are no specific blocks for googlebot-news.
Google has published a full list of which user agent tokens apply to which crawlers.
10. Matching encoded URLs to unencoded rules
Encoded URLs should match to unencoded rules, however, unencoded URLs will not match to encoded rules. Make sure to keep your rules unencoded, at least according to the robots.txt testing tool.
Even if you’re an experienced SEO, we hope that the above has unearthed some points about the Robots Exclusion Standard that you didn’t know previously. If you’re interested in learning more about this topic, then you might be interested in reading our introductory guide to robots.txt or we’ve also written about how robots.txt noindex used to work and alternative options for controlling indexing.