Finding & Fixing Duplicate Content

Originally published at SEMPO.org. Republished and updated here.

SEO specialists are no strangers to duplicate content.

Those two dreaded words together are aggravating not just for digital marketers, but also to Google. Some of the common issues can be avoided before they happen, and some may occur without you even knowing.

There are generally two different types of duplicate content – intentional and unintentional.

Let’s take a look.

Intentional Duplicate Content

Intentional duplicate content may be too strong of an indictment. Perhaps a better term is “I didn’t realize this would be problematic” content, or “this is what my competitors are doing”, or even “I used to do this 5 years ago”. A real life example could be a rent-a-car company that has a website for every major city and displays the same content for each site, shown verbatim on all of them.

Unintentional Duplicate Content

Unintentional duplicate content is a whole different beast.

In these instances, there is no desire to clone content. It happens unknowingly because of technical issues inherent in a content management system (CMS), or even the misconfiguration of certain settings associated with your CMS.

What are the main ways that duplicate content issues happen?

Websites exist with and without the “www” prefix

Search engines will get confused if they see a website that exists both with and without the “www”. When no directives are provided, Google can have a hard time understanding the preferred version of your site. For example, a website might exist in two different forms:

http://example.com/
http://www.example.com/.

Development / staging sites

Development sites also have the potential to create duplicate content. These sites must have the noindex meta tag in the head. For example, the site also exists under the URL http://dev.example.com.

URL Parameters

Faceted search or other URL parameters can sometimes create thousands upon thousands of combinations of URLs. Some are CMS-related, like Magento using parameters that filter products or affiliate URL parameters that can exponentially duplicate pages. Examples:

http://example.com/shop/?dir=asc&order=price/
http://example.com/shop/?dir=desc&order=price/

Pagination

Pagination is largely prevalent in eCommerce sites with a large volume of SKUs. The problem comes when the content on page one and every subsequent page stays relatively similar and dilutes the authority of the product category. SEO for eCommerce relies heavily on the proper use of rel=canonical and the use of rel=prev and rel=next. Examples:

http://example.com/blog/page/2
http://example.com/blog/page/123

Archive Pages

Category pages, tag pages, calendar pages, author pages – anything that “archives” website content – all have the potential to create duped content, even more so if the content management system displays the posts in their entirety instead of an excerpt that links to the full content. Examples:

http://example.com/blog/tag/buying/
http://example.com/blog/author/admin/

Why is Unintentional Duplicate Content Such a Big Deal?

To satisfy searchers, search engines want to display the most relevant and trustworthy pages, which may not be in your best interest, or the searcher’s.

In cases of duplicate content, search engines lack clarity as to which version is the desired URL, and they can ultimately index the incorrect one, multiple ones, or all of them.

This devalues the website for four major reasons:

Search engines don’t know which version to index, which to direct value to, or which to display to the searcher. This leads to authority dilution.
Googlebot has a certain amount of resources, or crawl budget, allocated to your website. They will tell you (in GSC Crawl Stats report) how many pages they crawl on a daily basis based on the value of your content, demand by searchers, and other factors. If Google sees minimal value in your pages, they may decide to lower their crawl budget.
Inbound links are one of the most important ranking signals. Duplicate content can often have link value distributed to the duplicated URLs, diminishing the strength of the page value that should be going to one single page.
If your site is generating 3 additional variations for every page you have, you’re essentially competing against yourself 3 times before your page even gains any traction in search and begins competing with other pages.

Google offers a great deal of duplicate content support documentation for these same reasons.

Google is the Big Brother

Website pages can be omitted by Google’s Supplementary Index. If Google notices there‘s a high percentage of duplicate content on your site, they might start to consider the site as untrustworthy and won’t visit as frequently (crawl budget) and move your pages to the supplementary index. You will know this is happening when you make “site” search and you receive this message at the bottom of the SERPs:

In order to show you the most relevant results, we have omitted some entries very similar to the # already displayed. If you like, you can repeat the search with the omitted results included.

Keep Your Site One Step Ahead

SEOs have tools that enable you to stay one step ahead of duplicate content issues and, if you do get behind, they can identify what needs to be addressed. Some tools are available for free, while more sophisticated tools are offered for a fee or on a subscription basis.

Some free duplicate content checkers include:

Siteliner
Copyscape
Small SEO Tools Plagiarism Checker
Screaming Frog

Some free Google SEO tools include Google Search Console:

HTML Improvements
Google Search Console: URL Parameters
Google Search Operators

How to Clean Up a Duplicate Content Disaster

Search engines need to be told the original source of the content. They may do an excellent job treating URL parameters but even a 1% margin error can still mean several thousand pages of duplicated content. You can offer better control and direction by canonicalizing your pages and setting up robot.txt directives.

Rel=canonical

Canonicalizing pages tells search engines that a page is using duplicate or near-duplicate content that from another website or page, and establishes the primary URL.

301 Redirects

Establish 301 redirects to direct users from the duplicate page to the one containing the original content.

No-Index

Using no index meta tags will tell search engines to not index pages so they do not appear in the search engine results- even if there are links to that page from within the website or external sites.

Robots.txt

The robots.txt file can be used to block future indexing of larger directories as opposed to just individual pages. Note that if the pages have been indexed, blocking from robots.txt will not remove the pages from the index.

Hreflang

Hreflang is a signal to tell search engines certain pages have alternative URLs in different languages or targeting different regions. This can help avoid duplicate content if you have a multilingual or multi-regional site with similar content.

Rel Prev/Next

Rel=prev and rel=next tags are used for paginated pages, and tells Google that they are being put into a sequence and the sequence flow is identified.

Categorize URL Parameters

GSC Parameter Categorization is used for indexing and de-indexing pages that are created due to parameters in URLs. These are primarily used for parameters that display the same content, but reordered.

Preferred Domain

GSC Preferred Domain is where you can choose either www or non-www for your site, using the http or https property. It helps Googlebot distinguish which URLs have priority in case of duplicate content.

Fix it Without Creating Duplicate Content Angst

If your brand creates a compelling ocean of great content the last thing you want is for an undertow of duplicate to build up that could have devastating repercussions.

Address duplicate content issues in a strategic and orderly way. Perform an SEO technical and content audit, and be diligent about how you go about it. Get familiar with the tools listed above, your site, and the content management system you use. Cross-reference the results that they pull in to ensure accuracy.

If you get confused or need additional input and advice, seek clarity from others in the SEO community and with web developer professionals who have the appropriate expertise.

It’s very likely you’ll experience duplicate content issues at some point, and almost guaranteed if you manage an eCommerce shop. Once you learn how to avoid these issues you will minimize the negative results they bring and you can then start focusing on beating your competitors.