Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
Use a Crawling Tool to Assess & Compare a Site Before & After a Migration
Before launching a site migration, John recommends using a crawling tool to get a full picture of your site’s status and signals (such as internal linking, canonicals, etc.) both before and after the migration to compare.
Videos Blocking Googlebot May Still be Crawled and Indexed
Blocking Googlebot from crawling a video may still result in a video snippet appearing in search if the video file is embedded from a different location, if some Google datacentres haven’t yet seen the updated version or if the video URL has parameters attached.
Internal Linking Causes Google to Crawl Canonicalised Pages
Check your internal linking if you see Google crawling canonicalised pages.
It’s Normal to See Fluctuations in GSC ‘Time Spent Downloading a Page’ Report
Seeing fluctuations in the ‘Time Spent Downloading a Page’ report in GSC is perfectly normal, as Googlebot sometimes discovers new areas of a site to crawl and can decide to crawl more URLs.
Googlebot Doesn’t Replay Cookies
If you provide cookies to Googlebot it won’t use them again when it returns to crawl your site, so bear this in mind when using cookies to group users for A/B testing and make sure Googlebot is always put in the same group.
Use ‘Validate’ Option In The New Search Console to Get Pages Recrawled
You can get Google to crawl your pages again if you go to the indexing report in the new Search Console and request the issues to be validated, where Google will recrawl these pages to check they’ve been fixed.
Google Will Crawl Sitemaps That Have Been Removed from GSC
It’s not enough to remove an old sitemap file from GSC to prevent it from being crawled, you need to remove it from the server to prevent Google from finding and crawling it. John recommends fixing the sitemap file if possible though.
Ensure All Product Pages Can be Crawled With Considered Use of Noindex
eCommerce sites with facets should be careful which pages are noindexed because this may make it difficult for Googlebot to crawl individual product pages e.g. noindexing all category pages. Webmasters might consider noindexing specific facets or deciding that everything after a certain number of pages in a paginated set be noindexed.
Small to Medium-Sized Sites Don’t Have to Worry About Crawl Budget
Sites with ‘a couple hundred thousand pages’ or fewer don’t need to worry about crawl budget, Google will be able to crawl them just fine.
Google Will Remember & Recrawl Noindexed Pages
Noindexed pages will be remembered and crawled by Google. They should be removed from sitemaps.