Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
Use rel=”canonical” or robots.txt instead of nofollow tags for internal linking
A question was asked about whether it was appropriate to use the nofollow attribute on internal links to avoid unnecessary crawl requests for URLs that you don’t wish to be crawled or indexed.
John replied that it’s an option, but it doesn’t make much sense to do this for internal links. In most cases, it’s recommended to use the rel=canonical tag to point at the URLs you want to be indexed instead, or use the disallow directive in robots.txt for URLs you really don’t want to be crawled.
He suggested figuring out if there is a page you would prefer to have indexed and, in that case, use the canonical — or if it’s causing crawling problems, you could consider the robots.txt. He clarified that with the canonical, Google would first need to crawl the page, but over time would focus on the canonical URL instead and begin to use that primarily for crawling and indexing.
APIs & Crawl Budget: Don’t block API requests if they load important content
An attendee asked whether a website should disallow subdomains that are sending API requests, as they seemed to be taking up a lot of crawl budget. They also asked how API endpoints are discovered or used by Google.
John first clarified that API endpoints are normally used by JavaScript on a website. When Google renders the page, it will try to load the content served by the API and use it for rendering the page. It might be hard for Google to cache the API results, depending on your API and JavaScript set-up — which means Google may crawl a lot of the API requests to get a rendered version of your page for indexing.
You could help avoid crawl budget issues here by making sure the API results are cached well and don’t contain timestamps in the URL. If you don’t care about the content being returned to Google, you could block the API subdomains from being crawled, but you should test this out first to make sure it doesn’t stop critical content from being rendered.
John suggested making a test page that doesn’t crawl the API, or uses a broken URL for it, and see how the page renders in the browser (and for Google).
There are several possible reasons a page may be crawled but not indexed
John explains that pages appearing as ‘crawled, not indexed’ in GSC should be relatively infrequent. The most common scenarios are when a page is crawled and then Google sees an error code, or the page is crawled and then a noindex tag is found. Alternatively, Google might choose not to index content after it’s crawled if it finds a duplicate of the page elsewhere. Content quality may also play a role, but Google is more likely to avoid crawling pages altogether if they believe there is a clear quality issue on the site.
How does Google handle infinite scrolling? Well, it depends…
One user asked whether Googlebot is advanced enough yet to handle infinite scrolling. John explains that pages are rendered by Google using a fairly high viewport. Usually, this means that some amount of infinite scrolling is triggered. However, it all depends on the implementation. The best way to check is to run the page through the Inspection tool to get a clear view of what Google is actually picking up.
Robots.txt file size doesn’t impact SEO, but smaller files are recommended
John confirmed that the size of a website’s robots.txt file has no direct impact on SEO. He does, however, point out that larger files can be more difficult to maintain, which may in turn make it harder to spot errors when they arise.
Keeping your robots.txt file to a manageable size is therefore recommended where possible. John also stated that there’s no SEO benefit to linking to sitemaps from robots.txt. As long as Google can find them, it’s perfectly fine to just submit your sitemaps to GSC (although we should caveat that linking to sitemaps from robots.txt is a good way to ensure that other search engines and crawlers can find them).
Regularly changing image URLs can impact Image Search
A question was asked about whether query strings for cache validation at the end of image URLs would impact SEO. John replied that it wouldn’t affect SEO but explained that it’s not ideal to regularly change image URLs as images are recrawled and reprocessed less frequently than normal HTML pages.
Regularly changing the image URLs means that it would take Google longer to re-find them and put them in the image index. He specifically mentioned avoiding changing image URLs very frequently, such as adding a session ID or today’s date. In these instances it’s likely they would change more often than Google would reprocess the image URL and would not be indexed. Regular image URL changes should be avoided where possible, if Image Search is important for your website.
Crawl rate is not affected by a large number of 304 responses
A question was asked about whether a large number of 304 responses could affect crawling. John replied that if a 304 is encountered, it means that Googlebot could reuse that request and crawl something else on the website and that it would not affect the crawl budget. If most pages on a website return a 304, it wouldn’t mean that the crawl rate would be reduced, just that the focus would be on the pages of the website where they see updates happening.
Blocking Googlebot in robots.txt does not affect Adsbot
A participant found that Googlebot was crawling their ad landing pages more than their normal pages. They asked if they could block Googlebot via the robots.txt and if doing so would impact their ad pages. John responded that blocking the ads landing pages for Googlebot is fine but make sure not to block Adsbot as it’s used to perform quality checks on the ads landing pages. He clarified that Adsbot doesn’t follow the normal robots.txt directives and in order to be blocked would require the specific user-agents to be named explicitly in the robots.txt file. Therefore, by just blocking Googlebot as suggested, Adsbot would still have access to those landing pages.
Having a high ratio of ‘noindex’ vs indexable URLs could affect website crawlability
Having noindex URLs normally does not affect how Google crawls the rest of your website—unless you have a large number of noindexed pages that need to be crawled in order to reach a small number of indexable pages.
John gave the example of if a website that has millions of pages with 90% of them noindexed, as Google needs to crawl a page first in order to see the noindex, Google could get bogged down with crawling millions of pages just to find those 100 indexable ones. If you have a normal ratio of indexable / no-indexable URLs and the indexable ones can be discovered quickly, he doesn’t see that as an issue to crawlability. This is not due to quality reasons, but more of a technical issue due to the high number of URLs that will need to be crawled to see what is there.
It can take years for crawling on migrated domains to be stopped completely
John confirmed that it takes a very long time (even years) for the Google systems to completely stop crawling a domain, even after they are redirected.