Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
URLs in JavaScript May be Crawled
Google won’t see any content which is loaded via an onclick event. But they will find URLs inside JavaScript code itself and try to crawl them. It has to be loaded onto the page be default without an onclick in order for Google to see it.
Google Identifies Boilerplate Content
John discusses how Google tries to understand the structure of pages to understand the standard boiler-plate elements of a page.
Hidden Content get Less Weight
Google tries to detect any content which isn’t visible when rendered and give it less weight than content which is visible.
Google Queues Large Volumes of New URLs
If Google discovers a part of your site with a large number of new URLs, it may queue the URLs, generate a Search Console error, but continue to crawl the queued URLs over an extended period.
Important URLs are Crawled Before Unimportant URLs
Google doesn’t start crawling unimportant URLs until it thinks it has crawled the important pages.
Google Ignores Content on 404 Pages But Recrawls Them
Google ignores all content on pages which return a 404 status, but will continue to crawl them periodically.
HTML sitemaps help indexing and crawling
If you have a complicated website, providing a mapping of your category pages can help Google to find pages and understand the structure of a website.
Duplicate Content Makes Large Sites Harder to Crawl
For large websites, duplicate content makes it harder to crawl.
Boilerplate Content Makes it Harder to Find Relevant Content
If your navigation is very large, it can add a lot of text to the page which might make it harder for Google to identify the parts of the page which are relevant. Google is trying to identify boilerplate elements which it can ignore, but the harder this is, the more likely that genuine content might not get classified as relevant.
Googlebot Doesn’t Support HTTP2
Googlebot doesn’t support HTTP2 only crawling, so your website still needs to work for HTTP.