Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
Google Only Needs to Crawl Facet Pages That Include Otherwise Unlinked Products
For eCommerce sites, if Google can access and crawl all of your products through the main category page then it won’t need to crawl any of the facets. However, facets should be made crawlable if they contain products that aren’t linked to from anywhere else on the site.
Use Crawlers Like Lumar to Understand Which Pages Can be Crawled
John recommends using crawlers like Lumar and Screaming Frog to understand which product pages Google can crawl on an e-commerce site.
Personalization is Fine For Google But US Version Will be Indexed
It is fine to personalize content to your users, but it is important to be aware that Googlebot crawls from the US and will index the content it crawls from the US version of the page. John recommends having a sizeable amount of content that is consistent across all versions of the page if possible.
Google Caches CSS & JS Files so Doesn’t Need to Continuously Fetch Them
Google caches things like CSS files so that it doesn’t have to fetch them again in the future. Combining multiple CSS files into one can help Googlebot with this process, as can minifying JavaScript.
A Sitemap File Won’t Replace Normal Crawling
A sitemap will help Google crawl a website but it won’t replace normal crawling, such as URL discovery from internal linking. Sitemaps are more useful for letting Google know about changes to the pages within them.
Pages Blocking US Access Also Need to Block Googlebot to Avoid Cloaking
If you need to block content from being accessed in the US or California, then you would need to block Googlebot as well, otherwise Google will see this as cloaking. One option might be to provide some general information that can be seen by visitors in the US.
Block Staging Sites From Being Crawled by Google
You should block Google from indexing your staging site as it can cause problems. You can block access based on Googlebot’s user agent, or using robots.txt.
Crawling But Not indexing Pages is Normal for Pages with Content Already on Other Indexed Pages
It’s normal for Google to crawl URLs, but not index them if they are not considered useful for search, such as index and archive pages which have content already indexed on other pages. This has been the case for a long time, but these pages have become more visible recently due to the ‘Crawled – currently not indexed’ report in Search Console.
Blocking Proxy IP Addresses is Fine for Google
Choosing to block proxy IP addresses from crawling or accessing a website won’t cause any problems for SEO as long as Googlebot can crawl the site, but you may lose out on additional users discovering your website.
Pages With Long Download Times Reduce Googlebot’s Crawl Budget
If a page takes a long time to download then this will use up crawl budget for Googlebot, meaning it will have less time to crawl other pages on your site. Look at the ‘time spent downloading a page’ report in GSC to spot these issues.