Crawling
Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.
For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:
Use Mobile Friendly Test to Check if Googlebot Can Access Page
Use the mobile-friendly test as an easy check to see if Googlebot can access a page. This will fetch the page with a Googlebot user agent and show you a screenshot of what was found.
Show Paywalled Content to Googlebot Based on User Agent & IP Lookup
It’s OK to show Googlebot paywall pages with class names and schema markup based on user agent. You can also combine that with an IP lookup to recognise when Googlebot is looking at a page as opposed to another crawler.
Google Mainly Uses GET Request For Normal Crawling & Indexing
Google pretty much only uses GET requests for normal crawling and indexing. However, that doesn’t mean you’ll never see POST and HEAD requests in your server logs, but probably they’re a lot rarer.
Google Does Some Scrolling on Pages
Google does some scrolling on a page to make that there is nothing that would otherwise be missed.
Blocking US IPs Likely to Block Googlebot. Having at Least Some US Accessible Content is Recommended
Other than the US, Google only crawls from a handful of other countries. If you block US IP addresses you’re probably blocking Googlebot, but you can test this with Fetch & Render or by checking log files. John recommends having at least some content accessible from the US, so that Googlebot and US users can go to your site.
Include Shared Content Block For Pages That Vary Dependent on Location
If content served varies dependent on location then John recommends having a shared content block across all variations as Google primarily crawls from IP addresses geo-located to San Francisco.
Google May Implement HTTP/2 Crawling as Sites Start Adopting Functionality
Google doesn’t crawl with HTTP/2, as it isn’t like a browser and wouldn’t see the same speed effects, but they would be able to cache things differently. Google engineers may decide to implement HTTP/2 for Googlebot as more sites start adopting HTTP/2 functionality, like Push.
Panda is Continuous But Doesn’t Run On Crawl
Panda does run continuously, and not to a timetable, but it does take a bit of time to collect relevant signals. John assumes that you would see the effects as Google reprocesses the bulk of a website, the frequency of which varies from site to site.
Only Change URLs When Absolutely Necessary as Can Cause Drop in SERPs
John recommends against removing old fashioned URL suffixes, like .html, as Google will treat these as new URLs and will recrawl and reindex them having to learn a new structure. This will lead to a significant dip in SERPs for a period of time until the URLs have been recrawled and reindexed.
For A/B Testing Show Googlebot Version Most Users Will See
When A/B testing, Google recommends showing Googlebot the version that most users are seeing. If doing 50/50 testing, it is up to webmasters which version to show to Googlebot but Google recommend against randomly varying the displayed version as it will make it difficult for Google to index the page.