Ethics and Best Practices
Can data be deanonymized?
In 2006, AOL Research released a dataset with 20 million searches from over 600 thousand accounts. It seemed like a good idea at the time: the data had been deanonymized after all! The problem is that deanonymization does not quite work like that at scale. If you look at your search history, you probably have a lot of stuff that uniquely identifies you: say, directions to your home address or other things unique to you. And with data at this scale, it’s easy to combine these data with other data to further find out more information about individual people. For example, the New York Times identified many people in the dataset by combining their search history with data from the phone book (a record of names and phone numbers which was common at the time) (Barbaro and Jr 2006).
In fact, it is possible to deanonymize many people using simple demographic data. Rocher, Hendrickx, and de Montjoye (2019) created a model which showed that “99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes”. You can try to see the chance that a dataset with has an entry with your zip code, birthday, and gender is about you in the embed below (no data is shared with any server on this website).
Common Ethical Principles
Ethical issues in research are not a new problem. In response to the Tuskegee Syphilis Study1, “The Belmont Report” (1979) created a set of ethical principles and guidelines to protect human subjects:
- Respect for persons
- Beneficence
- Justice
These principles remain influential in many fields. As it relates to Information and Communications Technologies (ICT), Kenneally and Dittrich (2012) added an additional principle: respect for law and public interest.
Beware of Pitfalls
Most practitioners of computational social science have good intentions at heart; however, it is important to be cautious if not defensive about potential negative effects of research. Lazer et al. (2014) illustrates an example of the pitfall of “big data hubris”, where one might be tempted to ignore foundational issues just become they have access to a lot of data. Google Flu Trends (GFT) tried to predict flu outbreaks based on search data. The idea was that an increase in searches related to flu symptoms would indicate an upcoming flu outbreak. While GFT was initially seen as rather successful, it consistently overestimated the prevalence of the flu. In fact, just using data from three weeks ago turned out to be a better predictor than GFT!
Lazer et al. (2014) suggest a few things that could have gone wrong here. First, Google is not a static entity. The search algorithm is constantly changing. They suggest this highlights the needs for greater transparency and replicability in research. Further, they suggest there was little value in improving over the existing, simpler lagged model from the CDC. Just because you can does not always mean you should. Finally, they caution that just because a model has more data behind it, that does not guarantee it is better.