What happens once a search engine has finished crawling a page? Let’s take a look at the indexing process that search engines use to store information about web pages, enabling them to quickly return relevant, high-quality results.
What’s the need for indexing by search engines?
Remember the days before the internet when you’d have to consult an encyclopedia to learn about the world and dig through the Yellow Pages to find a plumber? Even in the early days of the web, before search engines, we had to search through directories to retrieve information. What a time-consuming process. How did we ever have the patience?
Search engines have revolutionized information retrieval to the extent that users expect near-instantaneous responses to their search queries.
What is search engine indexing?
Indexing is the process by which search engines organize information before a search to enable super-fast responses to queries.
Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information. Instead, search engines (including Google) use an inverted index, also known as a reverse index.
What is an inverted index?
An inverted index is a system wherein a database of text elements is compiled along with pointers to the documents which contain those elements. Then, search engines use a process called tokenization to reduce words to their core meaning, thus reducing the amount of resources needed to store and retrieve data. This is a much faster approach than listing all known documents against all relevant keywords and characters.
An example of inverted indexing
Below is a very basic example that illustrates the concept of inverted indexing. In the example, you can see that each keyword (or token) is associated with a row of documents in which that element was identified.
Keyword | Document Path 1 | Document Path 2 | Document Path 3 |
SEO | example.com/seo-tips | moz.com | … |
HTTPS | deepcrawl.co.uk/https-speed | example.com/https-future | … |
This example uses URLs but these might be document IDs instead depending on how the search engine is structured.
The cached version of a page
In addition to indexing pages, search engines may also store a highly compressed text-only version of a document including all HTML and metadata.
The cached document is the latest snapshot of the page that the search engine has seen.
The cached version of a page can be accessed (in Google) by clicking the little green arrow next to each search result’s URL and selecting the cached option. Alternatively, you can use the ‘cache:’ Google search operator to view the cached version of the page.
Bing offers the same facility to view the cached version of a page via a green down arrow next to each search result but doesn’t currently support the ‘cache:’ search operator.
What is PageRank?
“PageRank” is a Google algorithm named after the co-founder of Google, Larry Page (yes, really!) It is a value for each page calculated by counting the number of links pointing at a page in order to determine the page’s value relative to every other page on the internet. The value passed by each individual link is based on the number and value of links that point to the page with the link.
PageRank is just one of the many signals used within the large Google ranking algorithm.
An approximation of the PageRank values were initially provided by Google but they are no longer publicly visible.
While PageRank is a Google term, all commercial search engines calculate and use an equivalent link equity metric. Some SEO tools try to give an estimation of PageRank using their own logic and calculations. For example, Page Authority in Moz tools, TrustFlow in Majestic, or URL Rating in Ahrefs. Lumar has a metric called DeepRank to measure the value of pages based on the internal links within a website.
How PageRank flows through pages
Pages pass PageRank, or link equity, through to other pages via links. When a page links to content elsewhere it is seen as a vote of confidence and trust, in that the content being linked to is being recommended as relevant and useful for users. The count of these links — and the measure of how authoritative the linking website is — determines the relative PageRank of the linked-to page.
PageRank is equally divided across all discovered links on the page. For example, if your page has five links, each link would pass 20% of the page’s PageRank through each link to the target pages. Links that use the rel=”nofollow” attribute do not pass PageRank.
The importance of backlinks
Backlinks are a cornerstone of how search engines understand the importance of a page. There have been many studies and tests performed to identify the correlation between backlinks and rankings.
Research into backlinks by Moz shows that results for the top 50 Google search queries (~15,000 search results), 99.2% of these had at least 1 external backlink. On top of this, SEOs consistently rate backlinks as one of the most important ranking factors in surveys.
Next Chapter: Search Engine Differences