Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
Combine Separate CSS, JS & Tracking URLs to Increase Googlebot Requests to Your Server
To improve site speed and allow Googlebot to send requests to your server more frequently, reduce the number of external URLs that need to be loaded. For example, combine CSS files into one URL, or as few URLs as possible.
Test How Search Engines can Crawl Internal Linking Using Crawling Tools
John recommends using tools like Lumar to test how your internal linking is set up and whether there are any technical issues which would prevent Googlebot from crawling certain pages on your website.
Discovered – Currently not indexed’ GSC Report Pages Have No Value for Crawling & Indexing
Google knows about pages in the ‘Discovered – currently not indexed’ report in Google Search Console but hasn’t prioritised them for crawling and indexing. This is usually due to internal linking and content duplication issues.
Update Last Modified Date in Sitemap & Use Validate Fix in GSC to Get Pages Crawled Sooner
If technical issues cause pages to show incorrectly (e.g. serving a blank page), you can get Googlebot to crawl these sooner by submitting sitemap files with the last modification date set to when the affected pages were restored. You can also click ‘validate fix’ on pages with errors in Search Console to get Googlebot to recrawl these pages faster.
Googlebot is Limited to Crawling a Couple of Hundred MB for Each HTML Page
Most sites shouldn’t worry about the size of their pages being too much for Google to crawl, as John explained that the cut off size for each page’s HTML is a couple of hundred MB.
Googlebot Doesn’t Use Sites’ Internal Search Features to Find Pages
Googlebot doesn’t know what to search for on a site, so doesn’t use a site’s internal search for content discovery. The rare exception to this will be if a site isn’t crawlable normally and pages can only be discovered through internal search.
For Mobile-first, Ranking Fluctuations Are Caused by Google Recrawling and Reprocessing a Site
If a site experiences ranking fluctuations after being switched to mobile-first indexing, this is because Google will need to recrawl and reprocess the site to update the index.
Block Ads From Being Crawled to Avoid Ranking For Unintended Queries
Ads which are inline with the main text of a page can be picked up by Google as part of the content of that page. This could cause the page to rank for queries related to the text in the ad. John recommends blocking the ads from passing pagerank and using JavaScript to block them by robots.txt.
Google Indirectly Interprets Charts & Graphs to Understand Context
Google doesn’t interpret charts and graphs to see if the numbers or information is useful and correct. However, indirect signals are collected (like text on the page, titles, descriptions, alt text etc.) to understand the context of the page.
Google Crawls Using Local IP Addresses For Countries Where They Are Frequently Blocked
Google will crawl with local IP addresses particularly for countries where US IP addresses are frequently blocked e.g. South Korea.