Introduction from BITSS: Open data has many benefits. It can foster collaboration, facilitate more complete meta-analysis, and improve the visibility of related research outputs. At the same time, we know that re-identification can cause real risks to study participants and that balancing openness with such risks is a delicate and often difficult task.
This is why research ethics and de-identification for data sharing have typically been discussed as inseparable at our training events. It’s also why it’s critical that institutions working with human subjects and looking to adopt open science policies consider the implications of ethics on transparency practices (see a Technical Note we developed for the Inter-American Development note on the subject).
With that in mind, we’re excited to see J-PAL’s new Guide to Publishing Research Data and the accompanying Guide to De-identifying Data, both written by Sarah Kopper, Anja Sautmann, and James Turitto. We hope these will be useful for students, researchers, and data stewards. Find them on the J-PAL website or in the BITSS Resource Library, which we recently redesigned and made it searchable by the topic, type, and discipline of each item.
This blog post, written by the authors of the guides, was originally posted on the J-PAL blog.
We are pleased to announce the publication of two new methods guides to de-identifying and publishing research data. These guides draw on J-PAL’s experience of publishing research data on randomized evaluations in the social sciences for more than a decade. They provide practical advice for students, researchers, and anyone else publishing their own or others’ data.
Researchers who plan to publish their data should take every effort to minimize the risk of re-identification of their study participants, as is commonly required by ethical standards, IRB protocols, and legal requirements. This is done through a process known as de-identification, in which variables that could be used to identify individuals are masked through techniques such as aggregation or encoding, or removed from the dataset altogether.
About the guides
The Guide to Publishing Research Data includes:
- A list of considerations to make before publishing data, such as what information was provided to study participants and the IRB, the sensitivity of the data collected, and legal requirements
- Sample consent form language that will allow future publication of de-identified data
- A checklist for preparing data for publication
- And more
The accompanying Guide to De-Identifying Data approaches de-identification as a process that reduces the risk of identifying individuals. It includes:
- An overview of personally identifiable information (PII) and the responsibility of data users not to use data to try to identify human subjects
- Recommendations for handling direct identifiers (such as full name, social security number, or phone number), as well as indirect identifiers (such as month/year of birth, nationality, or gender)
- Guidance on de-identification steps to take throughout the research process, such as encrypting all data containing identifying information as soon as possible
- A list of common identifiers, including those labeled by the United States’ Health Insurance Portability and Accountability Act (HIPAA) guidelines as direct identifiers
- And more
Why publish de-identified research data
Increasing the availability of research data benefits researchers, policy partners who supported the studies, students who learn from using the data, and, importantly, the people from whom the data was collected. Data sharing can provide many benefits and opportunities to the research community, including:
- Allowing for re-use of the data by researchers, policymakers, students, and teachers around the world
- Providing opportunities for new research, such as meta-analyses and questions on external validity and generalizability of results
- Enabling the replication and confirmation of published results as well as sensitivity or complementary analyses
J-PAL has been committed to making research more transparent for over a decade and supports the publication of de-identified research data in a digital repository such as J-PAL and IPA’s Datahub for Field Experiments in Economics and Public Policy, the Harvard Institute for Quantitative Social Sciences Dataverse, the Inter-university Consortium for Political and Social Research at the University of Michigan, or the Yale Institution for Social and Policy Studies Data Archive.