Disallow Directives in Robots.txt
The disallow directive (added within a website’s robots.txt file) is used to instruct search engines not to crawl a page on a site. This will normally also prevent a page from appearing within search results.
Within the SEO Office Hours recaps below, we share insights from Google Search Central on how they handle disallow directives, along with SEO best practice advice and examples.
For more on disallow directives, check out our article, Noindex, Nofollow & Disallow.
Site Removal Request is Fastest Way to Remove Site From Search
Disallowing a whole site won’t necessarily remove it from search. If the site has links pointing to it then Google may still index pages based on information from the anchor text. The fastest way to remove a site from search is using the site removal request in Search Console.
Disallowed Pages May Take Time to be Dropped From Index
Disallowed pages may take a while to be dropped from the index if aren’t crawled very frequently. For critical issues, you can temporarily remove URLs from search results using Search Console.
Redirecting Robots.txt Is OK
Google will follow redirects for robots.txt files.
Disallowing Internal Search Pages won’t Impact the Sitelinks Search Box Markup
Internal search pages on a site do not need to be crawlable for the Sitelinks Search Box markup to work. Google doesn’t differentiate desktop and mobile URLs, so you might want to set up a redirect to the mobile search pages for mobile devices.
Interstitials Blocked with Robots.txt Might be Seen as Cloaking
You can prevent Google from seeing a JavaScript run interstitial by blocking the JavaScript with robots.txt, but Google doesn’t recommend it as it might be seen as cloaking.
Server Performance and Robots.txt Can Impact HTTPS Migrations
An HTTPS migration might have problems if Google is unable to crawl the site due to server performance issues or files blocked in robots.txt.
Don’t Disallow a Migrated Domain
If you disallow a migrated domain, Google can’t follow any of redirects, so backlink authority cannot be passed on.
Use Noindex or Canonical on Faceted URLs Instead of Disallow
John recommends against using robots.txt disallow to prevent facet URLs from being crawled as they may still be indexed, and allow them to be crawled and use a noindex or canonical tag, unless they are causing a server performance issue.
Robots.txt Overrides Parameter Settings
URL Parameter settings in Search Console are a hint for Google, and they will validate them periodically. The Robots.txt disallow overrides the parameter removal, so it’s better to use the parameter tool to consolidate duplicate pages instead of disallow.
Only Disallowed Scripts Which Affect Content Are an Issue
Disallowed scripts which are flagged as errors are only an issue if they affect the displaying of content you want indexed, otherwise it’s OK to leave them disallowed.