Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
404 Pages Crawled Less Than Noindex
For expired/removed content, John says that Google prefer a 404 as it results in less crawling than a noindex.
Fetch & Render Shows Results for a Googlebot and Browser User Agent
The Fetch and Render tool shows you 2 different renders, one for Googlebot which used the Googlebot user agent, and one for users which used a browser user agent. If JS/CSS is disallowed for Googlebot, it may not be able to render all the content in the same way.
Clean HTML and Structured Data Helps Google Understand Content
Clean HTML and structured markup help Google better understand context
URLs in JavaScript May Be Crawled
JavaScript variables which look like URLs may be crawled, which can generate server errors. But you can ignore them, or block with robots.txt
HTML Crawling Faster Than JavaScript for Page Discovery
JavaScript processing takes longer than pure HTML crawling, so isn’t suitable for fast discovery of pages. John says ‘it takes another cycle or two longer to process’.
Image Re-Crawling Takes Longer After a URL Change
Images are not crawled very frequently, so when you migrate them to new URLs/domains, it will take a lot longer than pages, perhaps months.
Wildcard Subdomain Configuration Causes Crawl Issues
Using wildcard subdomains can make a site difficult to crawl.
CSS and JS Crawling Is Important for Mobile Compatability
Allowing your CSS and JavaScript files to be crawlable does affect desktop pages, but is more important for mobile pages as they need to test for mobile compatibility.
Noindex Pages Can’t Accumulate PageRank
Noindex pages can’t accumulate pagerank for the site, even though the pages can be crawled. So this isn’t an advantage over disallowing.
Disallowed URLs Don’t Pass PageRank
If a URL is disallowed in robots.txt, it won’t be crawled, and therefore can’t pass any pagerank.