Bottom stain Coffee stains
Darren Ravens

The AOL Data blunder, the SEO goldmine and the case for tin-foil hats

by Darren Ravens

2006/08/17

When AOL released what it thought was anonymised Internet-search data a little while ago, their heart was in the right place. They wanted to provide a resource for information retrieval researchers; a dataset of actual search results that could be analysed for search user trends and be used to infer all kinds of things about searchers’ behaviour.

In fact Google had earlier announced that they’d “processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.” and would be providing all this on 6 DVD available for order.

AOL’s data, containing the search queries of 650 000 or so of their users, is a time-stamped record of about 20 million actual searches and was made publicly available for download, and of course by the time they’d realised it was a mistake, the data was all over the net, being fiddled with by all kinds of people.

A whole host of sites have sprung up

http://www.aolsearchdatabase.com

http://www.dontdelete.com/

http://www.aolpsycho.com/

http://data.aolsearchlogs.com/search/index.cgi

So once the privacy violation hoohah has subsided (more on the privacy concerns later on), and the novelty of trying to identify which users have embarrassing personal problems and details of a dodgy personal life (like AOL user ID 2797847, who appears to have searched for, among other things, ordering street drugs online, herpes, 3rd degree felony expungement ohio and canadian call girls) has worn off, we in the search industry can start really looking at the value this kind of data has for SEO.

An SEO goldmine

What makes this data so special is that it represents actual searches, all time stamped and including clickthrough details. Not only can we see what terms users are searching for, we can analyse how they’re searching and what they doing with the displayed results.

It provides a means to analyse the way users search, giving an insight into the mindset of the searcher.

With a considerable percentage of searchers actually typing a question along the lines of what, where, why, who, how, it confirms the belief that many users use the search engines to find answers or solutions to their problems and suggests that when it comes to selecting keyphrases to optimise for, it wouldn’t hurt to create landing pages that ask the questions they are intended to answer.

Another cool way to use the data would be to use it to generate an estimate of actual traffic across the various search engines, based on their various market share proportions and the click-through-rates of queries on the AOL data, as some have already done.

http://seoblackhat.com/clicks-by-search-rank.html

http://www.bad-neighborhood.com/suggest.php

Further analysis of the data reveals that click-through–rates for the first position on the SERPS exceeds 40%, second position hovers around 12% and the things gets worse from there. So now we have solid confirmation of the benefits of ranking highly in the search engines.

We can also isolate the most frequently used terms for searches, giving some insight into the best modifier terms to use for doing keyword optimisation and it’s interesting to note how many people enter URLs in the search box to find a specific website.

What you need to consider here though, is that while AOL uses Google’s search results, the average AOL user is, to my mind, less search savvy than users of other engines. I know I know, I’m being a search snob, but I’ll stand by my theory and welcome your opinion on the matter….these people agree with me.

Bearing that in mind, the sample set of data therefore will be biased and any extrapolation of results across the search spectrum will not be completely accurate. But then, until the day Google decides to release all its search data (HA! Pencil it into your diary for the day after hell freezes over), we’re all stuck having to make do with what we’ve got, which will invariably include imperfect sample sets of data, statistical manipulation, guestimation and a healthy bit of common-sense.

So there you have it. AOL’s blunder is a boon for those of us in the search industry and provided AOL’s lawyers don’t force everyone to remove all online reference to the data, it could provide a means of SEO research for some time to come.

The case for tin-foil hats

Yes, there are privacy issues. Yes, there are people on the Internet who will look to mine the data for unscrupulous purposes. And if you really want to get an idea of how scary this all could be, consider that these 20 million queries represent roughly 1/10th of the daily search volume on Google so just imagine how much personal data Google has stored in a deep dark cave somewhere. And if you use a Gmail account, there’s a whole bunch of personal info there that they’ve got access too, plus there’s Google Earth so if they trace your IP they can practically watch you from space, psychoanalyse you based on search queries and monitor your interests for signs of deviance – at least that’s what the people at http://www.google-watch.org claim.

Tin foil hat anyone?  

Comments

I think the idea of creating pages aroung questions is a great one...!

Posted by Rob on 2006/08/17

Make a comment

To prevent GottaQuirk from becoming spam central, we block the use of certain words like porn, sex etc. We apologise for any inconvenience, but can't spend our lives deleting messages left by spammy friends.

Captcha
 
Pencil

eMarketing News

Subscribe to our fortnightly newsletter which is packed with interesting eMarketing news, views and other quirky titbits.

Subscribe

RSS

RSS to Email

Get our latest blog posts delivered straight to your inbox.

Follow us on Twitter

Follow us on Twitter


What's on offer?

Afrigator