Pages disappearing from search results. Wasted crawl budgets. Falling rankings and substandard user experiences. The problems caused by duplicate content are a big deal for SEO marketers.
While the advice is simple—don’t reuse text across webpages—the reality of avoiding duplicate content is a little more complicated.
What Is Duplicate Content?
According to Google’s Webmaster definition, “Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”
The most obvious place to find duplicate content is on-page; however, repeated titles and meta descriptions also count as duplicate content and can be harder for search marketers to spot and fix without a duplicate content checker.
Between 25 and 30% of the web’s content falls into the duplicated category, says ex-Google engineer Matt Cutts. It’s easy to see how this happens: generic product descriptions, boilerplate text, or brand messages are often repeated across domains and pages, without malicious intent. Search engines understand that duplicate content happens; that’s why, despite claims to the contrary, duplicate content doesn’t incur a penalty from Google.
Something that does incur a Google penalty is copied content. Copied content happens when spammers scrape content from an original source and put it on their own site. Like duplicate content, copied content results in two web pages with identical chunks of content; unlike duplicate content, however, copied content happens intentionally, fails to add value to the reader, and often involves a sub-quality website.
Search engines view scraped content seriously and can hit the scraping site with a penalty. It’s good practice to let crawlers know that your site’s content is not scraped from other sources. We’ll get into that below.
Does Duplicate Content Matter for SEO?
If duplicate content doesn’t cause Google penalties, can you happily leave it to run wild on your site? No. Duplicate content can still have a negative effect on your page rankings and organic traffic, without any actual penalty hitting your site.
First, search engines avoid returning duplicate entries on their results pages. This makes sense for searchers; after all, a results page with 10 identical results hosted on different pages is less helpful than a page with 10 varied, original results.
Search engines have to decide which version of duplicate content is most relevant. To do this, they consider domain authority and which page appears to be the original, most authoritative source of the content. Crawlers then filter out duplicates from results pages:
- If you’re featuring content that also appears on a more authoritative site, your URL will be filtered out of results pages in favor of the higher site with higher authority.
- If you have duplicate content across several pages of your website, the majority of these pages will be filtered out of the search engine results pages ( SERPs). Overall site visibility will suffer.
Second, duplicate content pages can dilute link equity and page authority. If your site hosts two different URLs with identical content, sites linking to your content will have to choose between the two versions. This spreads inbound links thinner than necessary, negatively affecting ranking signals for the pages in question.
How to Find Duplicate Content Issues
Duplicate content is often visible to the naked eye, but sometimes it’s hidden in the code of a website. That’s why it’s best to use software to check for duplicate content.
On-site Duplicate Content
Alexa’s SEO Audit tool contains a duplicate content checker that finds different URLs with the same content and advises how to fix them. The tool also alerts you to general duplicate content SEO tips, like the ones you can see in this tip box:
The Site Audit tool identifies duplicate content across meta descriptions and titles, producing an exportable list of URLs to make finding and fixing the problem easier:
Fixing these technical errors will help you improve meta-tag SEO, which results in higher click-through rates from search engine results pages (SERPs).
Off-site Duplicate Content
Off-site duplicate content—identical content that exists on different websites—can be harder to spot.
In order to make sure you aren’t posting content that already exists on another site, try using a plagiarism tool before publishing to make sure none of your blog content is plagiarized. This is particularly important if you’re working with outsourced writers or new team members who may be unaware of the importance of original content.
You can also use a plagiarism tool to see if other sites are not copying your content. Paid tools like Copyscape scan the web to find instances of content copied from your site. This type of off-site duplicate content is harder to fix, although you can try contacting the manager of the site and ask them to fix it. If that doesn’t work, read on for another way to deal with copied content.
8 Common Duplicate Content Issues and How to Fix Them
There isn’t a one-size-fits-all solution to duplicate content. But there are common fixes that help tackle the most common problems and their consequences:
1: Printer-Friendly Versions of Pages
Solution: Using a canonical tag will prevent printer-friendly and mobile page versions from becoming duplicate content issues. The canonical tag sets the main version of a page, and sends all ranking signals to that main version.
To set up a rel=canonical URL, place a chunk of code in the section of the page you want to position as canonical, replacing the URL with the URL on your site that is the original piece of content.
2: http/https or Subdomain issues
Changing over from HTTP to HTTPS should have a positive effect on your site’s rankings because Google sees HTTPS as a positive ranking factor. But the changeover can sometimes cause duplicate content issues because crawlers see two identical versions of your site.
The same thing arises with versions of the same site with and without the www. prefix. Bots have to choose between versions of the site, using up crawl budget and needlessly splitting link equity.
Solution: Setting a preferred domain in your site’s Search Console lets crawlers know which version of your domain they should focus on. To set a preferred domain, go to the Site Settings in Search Console, and select the option you want in the Preferred Domain section.
Note: Right now this option is only available in the old version of Search Console.
3: UTM Parameters and Session IDs
Using parameters to track information and session IDs is a great idea for accurate web marketing metrics. But search engines interpret each version as a different URL with duplicate content. Once again, the multiple versions will confuse crawlers and dilute ranking factors.
Solution: The rel=canonical tag allows you to set your preferred version of the URL. It guarantees that the right URL gets crawled by bots, and receives all the SEO benefits brought about by backlinks and site visits.
Note: the rel=canonical tag should only be used if the content is the same on each page.
Search engines can fail to recognize paginated pages and interpret them as duplicated content. There are different types of pagination issues that lead to duplicate content— for example, gallery pagination, when every item in a gallery has its own page; and category pagination, when product listings span several pages. Whatever the technicalities of the problem, they can all result in duplicate content issues.
Solution: Pagination problems are often solved by using the rel=“prev” and rel=“next” tags. These tell crawlers the exact relationship between the component URLs of a pagination series.
In March of 2019 Google announced that they decided to retire these tags, suggesting that users love single-page content, but paginated content can still include the rel=”prev” and rel=”next” tags.
5: Country/Language Versions of the Same Page
Sites often have country-specific domains with the same content on each—for example, www.yousite.com and www.yoursite.com.au, serving the US and Australia, respectively. It’s possible that almost all content on these sites will be duplicated, but webmasters still need to make sure that both appear in SERPs.
Solution: There are two options to help guarantee each domain’s visibility: top-level domains and the hreflang tag.
- Top-level domains appear at the end of a domain name and include familiar forms such as .com, .org, .edu, .net, .gov, as well as country-level domains. Google recommends using these top-level structures to send clear signals that content is serving different geographies. That means http://www.example.de is easier to understand from the perspective of a search engine than http://de.example.com, which is not a top-level format.
- The hreflang tag helps bots show users the correct version of a site for their location. Adding the following code to the section of your site will show users in Spain the Spanish version of your domain, for example:
“alternate” href=“http://example.com” hreflang=“en-es” />
Crawlers won’t identify translated versions of a site as duplicate content, thanks to the hreflang.
6: Copied Content
Spammy sites stealing your content is a reality of life. Unfortunately, this kind of activity can negatively affect the original site. That’s why you need to act against copied content and protect your site’s authority.
Solution: First, try getting in touch with the offending site and asking them to remove the content. If they don’t, you can learn more here from Google on how to report copyright infringement.
7: Syndicated Content
Sharing your content with high-ranking partner sites can be an awesome way to drive referral traffic and get valuable backlinks. But if you take this route, you need to make sure crawlers understand this is not duplicate content. Failure to do so might cause the site you share to appear in SERPs and your own site to be filtered out, even though you made the content.
Solution: Before you agree to let a blog syndicate your content, ask them to include a rel=canonical tag in the element on each URL featuring your content. This is part of effective SEO content planning.
8: Boilerplate Content
Boilerplate content is text repeated across domains, but non-maliciously. For example, you’ll often see boilerplate content on ecommerce domains when suppliers provide standard text to be used when selling their products. Retailers then reuse this text to save time; the downside is that crawlers understand this is a duplicate content issue.
Ecommerce retailers should rewrite product descriptions when possible. This requires a lot of sweat equity, but it avoids duplicate content and improves ecommerce SEO. If you have boilerplate content on your blog or other SEO content, try to make sure that the pages containing the boilerplate content also have enough additional content to differentiate them for both users and search engines.
Best Practices to Prevent Duplicate Content
Discourage sites from stealing your content and mitigate the impact of duplicate content on site rankings by following these preventive measures:
- Stop spammy scraper sites from taking credit for your content by using a self-referential rel=canonical link on your site’s pages. This chunk of code in the original page’s section points to itself as the canonical reference for a page. If any sites copy the URL’s content, search engines can identify your page as the ultimate source of truth.
- Link to the canonical versions of your site’s URLs at all times. For example, if you have a page with both a mobile and a desktop version, choose which is canonical, and then point all internal links to that page only. If you build external links to that URL, make sure all go to the canonical link as well. This will send clear signals to crawlers about which link you want to appear in SERPs.
- Use a 301 redirect where appropriate to minimize duplicate content by consolidating similar pages into one powerful page. You may have built up several similar landing pages over time, all of which contain similar information and are trying to rank for the same keyword. A 301 redirect will prevent these pages from competing, and send stronger ranking signals to the preferred page.
Protecting your site from duplicate content is best practice. However, duplicate content issues can still arise.
After investing sweat equity in keywords research, content strategy, and marketing plans, you don’t want to lose out to competitors because of avoidable duplicate content issues. Monitoring and fixing issues like these should be part of ongoing SEO hygiene.