Bottom stain Coffee stains
Shaun Oakes

Duplicate Content and the Hot Sibling Conundrum

by Shaun Oakes

2008/11/18

Image via Flickr, by Zabowski under CC

Tracy Martin was probably one of the most beautiful girls in my Matric (senior) year. Tall, slender and very athletic, she was the type of girl that most guys would gladly give up an eye - or at least some form of non-essential body organ - for a chance to be with. It was strange then, that she was largely ignored by the cooler kids and instead spent her days socialising with the Computer Club members and the hairy girl with the weird hump on her back.

You see, Tracy had an equally gorgeous twin sister called Tamsyn, who was massively popular and loved by all and sundry. With Tamsyn around, Tracy never really got the credit her looks and personality may have deserved, as she was seen as nothing more than an imitation of her more illustrious sibling.

So why am I telling you this?
Well, partly because I am bitter about having to settle for snogging the hairy girl with the weird hump on her back. More importantly however, it also serves as an accurate analogy for the duplicate content issue in the search engines.
Let's take a quick look at some of the key points and common questions regarding duplicate content:

Er... What is Duplicate Content?

We will start things off by explaining what the term "duplicate content" actually means. Google Webmaster Central Blog describes it as follows - "Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar." So there you go, done and dusted.

Now if you have one or two paragraphs of text pulled from another source, (eg: you're showing an extract) but the majority of your page content is unique, this won't be considered duplicate content. If that were the case, most news portals and personal bloggers would be screwed, so rest easy on that one. Coders can also relax - if your code is identical on every page, this also has no bearing and will not be considered duplicate content - it's the actual content that users can see which will be evaluated. In terms of the acceptable duplicate / unique content ratio on a page, there is no real magic number. As a rule of thumb then, just make sure that the majority of your page content is unique and original and you should be fine.

When and Where Does it Occur?

Duplicate content issues can occur on your own website and are usually accidental, especially when you're running a CMS on your blog that generates URLs. Basically this means that you could have a post listed under www.mysite.com/category/widgets/ as well as www.mysite.com/2008/10/. It may also occur when you're using syndicated content or allowing something you have written to be syndicated. For example the article you wrote about the mating habits of the Remora fish could get picked up and appear on other aquatic-related websites. It can also occur when other sites "scrape" your content, pulling what you've written and republishing it on their sites without crediting you.

Okay, So What will Happen to My Site if I Have Duplicate Content?
Here's the thing, I've heard many people saying that your site will be penalised and removed from the search engines if it's found that you have duplicate content. This is not really true though. If a search engine spider finds two pages on your site with identical copy, it will just choose the one it feels is the most relevant. The problem with duplicate content on your site then, is that often the spiders will choose the "wrong" page to  display, for example the printer friendly version of a page. This isn't really what you want appearing on the SERPs is it? No, of course not.

Having two or more pages with duplicate content also dilutes the "search engine juice" your site may carry for that content. For instance, having three pages with the same Remora fish article on your site and with each page having five different internal links pointing to them, doesn't make much sense. It would be smarter to have one page on the Remora fish, with all 15 internal links pointing to that specific url. Make sense?

If you have permission from a blog or other web resource to use their content and are worried about getting penalised for any duplicate content issues, the best thing to do would be to use the NoIndex meta tag on your page (meta name = "robots" content="noindex, follow"). This tells the spiders to ignore the page and not attempt to index it, which is when it would notice that it's a duplicate, making that page solely for the web user's benefit.

Okay, But What Happens if Someone Else is Duplicating my Content?

If you're allowing your content to be syndicated, always insist that they link back to your article. This tells the search engine spiders that your work is the original source and that they are the duplicate. As your website becomes more popular, the chances of someone else stealing or copying your content without crediting your work become greater. Most of the time, these nasty "scrapers" are dodgy and will immediately be regarded as duplicate content and will either be banished to the supplemental results or won’t appear on the search engine results pages (SERPs) at all.

If, for whatever reason, they are being indexed on the SERPs and are actually ranking better than you (because they're either a massive or popular site themselves, or your site is relatively new) you have a few options. You can either send someone over to break their legs - which can be mildly effective, although they would still be able to use a computer - or you can file a website infringement notification which will lead to an investigation and will likely see their site removed from the SERPs if your complaint is found to be valid.

Great, Can You Sum This All Up Then?
Going back to my analogy, Tracy Martin represented the original source, Tamsyn Martin served as duplicate content and all the cool kids from high school represented the search engine spiders (the hairy girl with the humped back is just something that keeps me up on cold, lonely nights). To summarise then, if you're duplicating your own content on your site, you will not be penalised, but it will put your site at a disadvantage on the SERPs as it decreases the chances of your pages ranking well. If you are duplicating content from other sources, play it safe and "NoIndex" these pages so that you won't fall foul of Google.

For more info, SEOmoz also has a lovely graphic illustrating the process – check it out here.

Comments

Well first things first.

Where are Tracy and Tamsyn?

Second Thing, and its one that not a lot of blog owners think about, is that if you are running Wordpress, you need to know some things. Wordpress is inherently duplicating content, and there is a solution
http://www.freakitude.com/2007/06/16/wordpress-duplicate-content-issues-solutions/

Posted by Smith on 2008/11/18

Hi Smith,

The twins are now but a distant high school memory.

Bang on the money about Wordpress, which is what I had in mind when mentioning the blog issue.

The http://www.seologs.com/wordpress-duplicate-content-cure/ looks to be our answer.

Posted by Shaun on 2008/11/18

...which, as fate would have it, turns out to be a broken link :(

Posted by Shaun on 2008/11/18

Just a side note, the URL you supplied in your comment Shaun refers to a page that does not exist.

We run a wordpress blog, i found the best solution was to simply customise your robots.txt, as is mentioned in post relating to dup content problems.

Most people don't realise that making use of the custom permalinks option in wordpress does not replace those initial "p?" id's for posts, as they are still accessible.

Posted by Brett Pringle on 2008/11/18

Yeah, I picked that up 2 seconds after hitting the "Submit" button.

Well spotted, Brett.

Posted by Shaun on 2008/11/18

Make a comment

To prevent GottaQuirk from becoming spam central, we block the use of certain words like porn, sex etc. We apologise for any inconvenience, but can't spend our lives deleting messages left by spammy friends.

Captcha
 
Afrigator