In this next section, we’re going to look into content duplication, which is an issue that impacts the vast majority of websites. We need to understand how search engines treat duplicate content and how we can provide them with signals to ensure that the main canonical (preferred) version of a page is indexed.
What is duplicate content?
Duplicate content occurs when the same, or appreciably similar content, appears in more than one unique location or URL. Duplicate content comes in varying degrees, two pages can be exact duplicates where all of the content is the same or partial duplicates where some of the content is the same.
Why is content duplication an issue for SEO?
If a website features exact duplicates, search engines will commonly crawl each of these but, when it comes to indexing, one version of the page will be chosen to be indexed.
If a website features partially duplicated content, all of these pages may be indexed by search engines which could cause them to rank for the same search queries. This could be problematic for inbound links as people may start linking to duplicate content instead of focusing on one main page, thus spreading link equity instead of consolidating it.
How do search engines treat duplicate content pages?
In general, search engines treat duplicate content as a natural part of the web. Google doesn’t penalize websites for duplicate content, unless they are entirely made up of duplicate content. However, it is the job of SEOs and webmasters to provide signals to search engines that indicate the preferred page in a duplicate set.
For search engines, duplicate pages are confusing, as they don’t know which version of that content to index and rank, which ends up diluting the equity of all pages in a set and the site overall, because the crawler does not know how to direct the link equity correctly.
Common causes of duplication
Duplicate content can arise within one website and across different websites in a number of ways. Some of the most common causes of content duplication include:
- Session IDs in a URL being used instead of storing sessions data for a user via cookies, where the system fallbacks to creating session IDs in the URL. In this scenario, for every internal link on the page, a duplicate URL is created, as session IDs are appended to the URL which are unique to each session.
- Tracking/sorting parameters that do not change the content of the page can be a huge cause of duplication on a website. Parameter URLs can also cause further duplication due to their ordering and how they can selected on top of one another, creating many different URL combinations e.g https://www.example.com/brand?colour=red&price=0%20TO%2050&sort_by=price.
- Printer-friendly versions of URLs can also cause duplicate content, e.g. pages with a /print/ variant.
- When both “www” and “non-www” versions of your website are accessible and indexable by crawlers, this causes sitewide duplication. The same can be said for “HTTP”and “HTTPS” variants.
- Scraped & syndicated content, where versions of your content are being re-published elsewhere.
How to address duplication issues
In order to address the causes of content duplication, here are some actions you can take to help provide Google with signals to the canonical version of a page.
- Utilise 301 redirects to redirect the duplicate page to the original page, or to the preferred version of the page.
- Use the rel=”canonical” attribute to deal with duplicate content. This will tell search engines which version of the content to apply all equity and ranking metrics to.
- Ensure all internal linking is consistent by setting your preferred domain in Google Search Console and minimizing excessive boilerplate content.
- Search engines do not recommend restricting crawler access for duplicated content as they will not be able to determine the main version of the page without being able to crawl all of them.