OK Cupid, an online dating site, has caused a bit of a stir recently about performing experiments on their users. But even without the ethical questions there’s reason to be skeptical about what their data can actually tell us.
Big Data, the book by Viktor Mayer-Schonberger and Kenneth Cukier, talks about two phenomena they believe will drive a big data revolution: ‘Digital exhaust’ and ‘N equals all’. The first refers to the trail of information we leave behind when using the internet that are the residue of clicks and typing.
The second idea is that modern technology allows researchers to analyse the whole population rather than just inspecting a smaller sample. With big data, Researchers have access to vast datasets — N, the number of observations in the sample, is equal to all.
But in this case it isn’t. OK Cupid may be able to look at the entire trail unconsciously left by their users but their users aren’t the world, they’re a sample, and a self-selecting one at that.
Take this chart from an OK Cupid blog showing that men have a marked preference for
younger women and that they lie about this on their dating profiles. The chart shows that they are messaging significantly younger women than the youngest allowable match listed on their profile.
While the insight that some older men are attracted to younger women isn’t exactly new — its the basis for one of the Canterbury Tales — in this case the fact is given a scientific spin by the use of data.
However it’s likely to be partially driven by the way in which the sample is gathered and the fact that unlike the ‘big data’ ideal, N doesn’t equal all.
Another graph on the same blog shows why. OK Cupid’s user profile skews young, so older men may more often message younger women simply because there’s more of them around. There are more than six times as many 22 year-old women on OK Cupid as there are 48 year olds.
And these older men are self-selecting: they have chosen to register and continually use a free dating service that skews younger — fair to say that there are reasons to think that the older men who use OK Cupid may have different tastes to the average man.
Maybe this doesn’t matter too much; the insights in this case come from a slightly tongue-in-cheek blog looking at why men should be keener to date older women, not a peer-reviewed article in a journal of psychology. Maybe worries about sampling bias just aren’t the point.
But it illustrates the pitfalls of relying on ‘data exhaust’ and assuming that N=all. A huge amount of data is input by users of social networks like Facebook and Twitter or internet companies like Amazon and Google and plenty are eager to explore the insights that it might reveal. But it’s important to remember that at every stage who is using those services and what data are left behind will be determined by the nature of the platform itself.