Saturday, January 12, 2008

Would You Like Some Data?

I was always fascinated by data. More to the point, with consuming, analyzing and mining data for results, trends and predictions. In my opinion, utilizing facts collected in the past to predict the future is one of the greatest tasks of human existence.

Scientists do that every day. Trying to use past experiments to cure present and future illnesses, using theorems proved hundreds of years ago to solve problems our ancestors would never have dreamed about.

In this post, I'd like to look at 3 samples of data collection and analysis I've encountered this past week. I'll then discuss what attracted me to those samples and what do they represent (to me).

1. Proprietary data my @$$
If you ask Facebook, or any other social network, for demographic data (number of members, their nationalities, their sex and age) you'll get a puzzled look. This is "proprietary data". We don't share it with users for "privacy reasons" (which doesn't prevent us from discussing those numbers with our advertisers - a trustworthy bunch, I'm sure you'd all agree). In short, you join one of those networks, without knowing who your peers are.

Well, one Facebook user found a chink in the armor. Using a simple method, he managed to deduce the following (click the image to enlarge):

How did he do it? Did he hack the system? Did he use social engineering skills to get one of Facebook's employees to surrender the precious, proprietary data? None of the above.
All he did was go to the site's dating application, and look for women, specifying nothing (age, city, or any other preference), but the country. Repeat for men. Repeat for any country on the list. Throw into an Excel spreadsheet - and there you go: instant data mining.

Now, how many sites this can be repeated on, I wonder, before they force you to specify more criteria and limit your search? And even then, it just means more steps will need to be taken to collect the same data?

Add to that the fact that most social networks are now offering one API or another (like Google's OpenSocial, supported by several such networks, Facebook included) and you can see how easily data on those sites can be mined. If you force me to take more steps to get at your data, I'll just write a small algorithm that uses your API. A computer program is extremely efficient at iterating through tedious steps and accumulating numbers.

2. So, what browser are YOU using?
I use 2 services to tally the visitors data on this site. Both are free and unintrusive. From time to time, I compare the data I get from both and nod my head.
Here's last week's breakdown of the browsers used by you, my dear visitors:
1. Google Analytics data


2. SiteMeter data

According to Google, 71% of you are using Firefox and about 26.5% are using IE.
SiteMeter breaks it down by version and paints a different picture: Firefox (1+2+3+Mozilla) 38.3%, IE (4+5+6) 57.5%.
Since both sites usually agree on the total number of visitors (to within a 5% deviation), I find the difference in results confusing, to say the least.

A visitor's browser is determined by analyzing a string, submitted by your browser in the HTTP GET header, called the User Agent string. It looks like this: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) meaning IE 6 on Windows XP. So how can 2 sites analyze the same amount of strings and arrive at such disparate results? And who do I trust here?

3. Where's the best place to live?
I stumbled upon the last sample in the latest PC Magazine. utilizing 3 different sites and correlating job offers from dice.com, house prices from trulia.com and domestic data from homefair.com, they managed to present this interesting table:
Evidently, it's best to work at San Jose, have a wife in New York City and buy a house (or several) in Philadelphia. Clearly, crossing information from several sources yields some unexpected trends.
[BTW, I'm missing a row here with data from Raleigh, NC - another booming hi-tech community. If any of my friends there cares to find that data for me, I'll be glad to publish it].

So, today we have seen you can get at any raw data, regardless of how much it's being protected; you can cross-reference data from different sources to find hidden trends; but, you can't always trust data that's offered to you for free.

And if you take one thing from this post it's this: with data, it's more important to know which questions to ask, than to look at answers (and a tip of the hat to Douglas Adams who has concluded the same years ago. In his 'Hitchhiker's Guide to the Galaxy', 42 is the answer to Life, the Universe and Everything, but no one knows what the question is).

[To read some more about my fascination with data, start with Set your dark data free! and Lab Tour. More can be found throughout the blog.]

Update 1/12/2008:
I just noticed a strange behavior in this post. Whenever you click to enlarge one of the images in Firefox, you may get a blank screen. In IE, you get the right image. This has something to do with the way Google hosts those images, as I can't view them in FF even outside the blog.
Strange... could this be related to the post's content rolleyes?

No comments: