Home / SEO Office Hours / Crawling / Page 4

Crawling

Before a page can be indexed (and therefore appear within search results), it must first be crawled by search engine crawlers like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our SEO Office Hours notes, as well as further research and recommendations.

For more SEO knowledge on crawling and to optimize your site’s crawlability, check out Lumar’s additional resources:

404 or 410 Status Codes Will Not Impact a Website’s Rankings

If Google identifies 404 or 410 pages on a site, it will continue to crawl these pages in case anything changes, but will begin to phase out the crawling frequency to concentrate more on the pages which return 200 status codes.

1 Nov 2019

Last Modification Dates Important For Recrawling Changed Pages on Large Sites

Including last modification dates on large sites can be important for Google because it helps prioritize the crawling of a changed page which might otherwise take much longer to be recrawled.

29 Oct 2019

Google Has a Separate User Agent For Crawling Sitemaps & For GSC Verification

Google has a separate user agent that fetches the sitemap file, as well as one to crawl for GSC verification. John recommends making sure you are not blocking these.

1 Oct 2019

Blocking Googlebot’s IP is The Best Way to Prevent Google From Crawling Your Site While Allowing Other Tools to Access It

If you want to block Googlebot from crawling a staging site, but want to allow other crawling tools access, John recommends whitelisting the IPs of the users and tools you need to view the site but disallowing Googlebot. This is because Google may crawl pages they find on a site, even if they have a noindex tag, or index pages without crawling them, even if they are blocked in robots.txt.

1 Oct 2019

Ensure Google is Able to Crawl All Pages Involved Within Infinite Scroll

When implementing infinite scroll, ensure Google is able to reach all of the pages involved. John recommends the best way to do this is by linking to all the pages individually through a pagination set up, to ensure each page can be crawled.

27 Sep 2019

Google is Unable to Crawl User-triggered Events

Googlebot cannot crawl user-triggered events, for example content loading once a user scrolls. John recommends using dynamic rendering to enable crawling of these events and ensuring the content loads with a link rather than an interaction.

27 Sep 2019

Prevent Search Engines From Crawling Low Quality UGC

When working with user-generated content, John recommends filtering out high quality pages to ensure search engines are able to see this rather than the lower quality content.

6 Sep 2019

Google Can Periodically Try to Recrawl 5xx Error Pages

If a server error is shown on a page for as long as a week, Google can treat this in a similar way to a 404 error and will reduce the crawling of that page and remove it from the index, but will still access the page every now and again to see if the content is available again. If so, the page will be indexed again.

3 Sep 2019

Google Can Crawl Different Parts of a Website at Different Speeds

Google is able to detect how frequently the different sections of a site are updated and crawl them at different speeds, so that the frequently changing pages are crawled more regularly.