About a decade ago, a hacker said to me, flatly, “Assume every card in your wallet is compromised, and proceed accordingly.” He was right. Consumers have adapted to a steady thrum of data breach notifications, random credit card charges, and out-of-the-blue card replacements. A privacy-industrial complex has sprung up from this — technology, services, and policies all aimed at trying to protect data while allowing it to flow freely enough to keep the modern electronic bazaar thriving. A key strategy in this has been to “scrub” data, which means removing personally identifiable information (PII) so that even if someone did access it, they couldn’t connect it to an individual.
So much for all that.
In a paper published in Science last week, MIT scientist Yves-Alexandre de Montjoye shows that anonymous credit card data can be reverse engineered to identify individuals’ transactions, a finding which calls into question many of the policies developed to protect consumers and forces data scientists to reconsider the policies and ethics that guide how they use large datasets.
de Montjoye and colleagues examined three months of credit card transactions for 1.1 million people, all of which had been scrubbed of any PII. Still, 90% of the time he managed to identify individuals in the dataset using the date and location of just four of their transactions. By adding knowledge of the price of the transactions, he increased “reidentification” (the academic term for spotting an individual in anonymized data) to 94%. Additionally, women were easier to reidentify than men, and reidentification ability increased with income of the consumer.
To be clear: Reidentification means that the researchers could identify all the transactions that belong to an individual, but de Montjoye didn’t attempt to say which individual. For example, if he wanted to know my transactions, he’d need to take additional steps to cross reference something he knew about me to his data. If, for example, I posted on Facebook about a trip to a restaurant, that could provide the key to connecting me to an entire portfolio of anonymous transactions. “We didn’t try to put names on it,” de Montjoye says, “but we know basically what you need to do that.”
What’s more, de Montjoye showed that even “coarse” data provides “little anonymity.” He lowered the “resolution” on his data by looking only at areas where purchases happened, not specific shops, and 15-day time frames in which they happened, not specific dates. He also broadened the price range of the purchases so that transactions that previously were categorized as between $5 and $16 were now put in a bin more than twice as big that ranged between $5 and $34. Even with low-res data like this, he could pluck out four transactions and reidentify individuals 15% of the time. By looking at 10 such data points, he could, remarkably, reidentify individuals 80% of the time.
It’s not the first time de Montjoye has played the part of privacy killjoy. In previous work he pulled off a similar trick, reidentifying individuals using anonymous mobile phone location data. (Others have performed similar parlor tricks with other datasets.) And while he hasn’t yet tested other types of large datasets, such as browsing histories, he believes that “it seems likely” that they, too, are susceptible to reidentification.
The implications of de Montjoye’s work are profound. Broadly, it means that anonymity doesn’t ensure privacy, which could render toothless many of the world’s laws and regulations around consumer privacy. Guaranteeing anonymity (that is, the removal of PII) in exchange for being able to freely collect and use data — a bread-and-butter marketing policy for everyone from app makers, to credit card companies — migh
via hbr.org