Ethics and Best Practices

What are the pitfalls and potential ethical issues in computational social science research?

Readings

Can data be deanonymized?

In 2006, AOL Research released a dataset with 20 million searches from over 600 thousand accounts. It seemed like a good idea at the time: the data had been deanonymized after all! The problem is that deanonymization does not quite work like that at scale. If you look at your search history, you probably have a lot of stuff that uniquely identifies you: say, directions to your home address or other things unique to you. And with data at this scale, it’s easy to combine these data with other data to further find out more information about individual people. For example, the New York Times identified many people in the dataset by combining their search history with data from the phone book (a record of names and phone numbers which was common at the time) (Barbaro and Jr 2006).

In fact, it is possible to deanonymize many people using simple demographic data. Rocher, Hendrickx, and de Montjoye (2019) created a model which showed that “99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes”. You can try to see the chance that a dataset with has an entry with your zip code, birthday, and gender is about you in the embed below (no data is shared with any server on this website).

Common Ethical Principles

Ethical issues in research are not a new problem. In response to the Tuskegee Syphilis Study1, “The Belmont Report (1979) created a set of ethical principles and guidelines to protect human subjects:

  • Respect for persons
  • Beneficence
  • Justice

These principles remain influential in many fields. As it relates to Information and Communications Technologies (ICT), Kenneally and Dittrich (2012) added an additional principle: respect for law and public interest.

Beware of Pitfalls

Most practitioners of computational social science have good intentions at heart; however, it is important to be cautious if not defensive about potential negative effects of research. Lazer et al. (2014) illustrates an example of the pitfall of “big data hubris”, where one might be tempted to ignore foundational issues just become they have access to a lot of data. Google Flu Trends (GFT) tried to predict flu outbreaks based on search data. The idea was that an increase in searches related to flu symptoms would indicate an upcoming flu outbreak. While GFT was initially seen as rather successful, it consistently overestimated the prevalence of the flu. In fact, just using data from three weeks ago turned out to be a better predictor than GFT!

Lazer et al. (2014) suggest a few things that could have gone wrong here. First, Google is not a static entity. The search algorithm is constantly changing. They suggest this highlights the needs for greater transparency and replicability in research. Further, they suggest there was little value in improving over the existing, simpler lagged model from the CDC. Just because you can does not always mean you should. Finally, they caution that just because a model has more data behind it, that does not guarantee it is better.

Best Practices

Barbaro, Michael, and Tom Zeller Jr. 2006. “A Face Is Exposed for AOL Searcher No. 4417749.” The New York Times, August.
Charlotte Jee. 2019. “You’re Very Easy to Track down, Even When Your Data Has Been Anonymized.” MIT Technology Review. https://www.technologyreview.com/2019/07/23/134090/youre-very-easy-to-track-down-even-when-your-data-has-been-anonymized/.
“Ethics.” 2018. In Bit by Bit: Social Research in the Digital Age, 281–354. Princeton: Princeton University Press.
Kenneally, Erin, and David Dittrich. 2012. “The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2445102.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–5. https://doi.org/10.1126/science.1248506.
Rocher, Luc, Julien M. Hendrickx, and Yves-Alexandre de Montjoye. 2019. “Estimating the Success of Re-Identifications in Incomplete Datasets Using Generative Models.” Nature Communications 10 (1): 3069. https://doi.org/10.1038/s41467-019-10933-3.
“The Belmont Report.” 1979, April.
Zook, Matthew, Solon Barocas, danah boyd, Kate Crawford, Emily Keller, Seeta Peña Gangadharan, Alyssa Goodman, et al. 2017. “Ten Simple Rules for Responsible Big Data Research.” PLOS Computational Biology 13 (3): e1005399. https://doi.org/10.1371/journal.pcbi.1005399.