Open data makes for more credible science: two articles on disappearing data

Why is scientific misconduct such a problem? One major reason is that, until a few years ago, there weren’t many effective mechanisms for keeping researchers accountable. Even if someone wanted to replicate a study, obtaining data from published authors could be extremely difficult. In 2014, zoologist Timothy Vines and his colleagues wrote a short piece describing just how big an issue disappearing data was. Less than a year earlier, political scientist Rick Wilson became the new editor of the American Journal of Political Science (AJPS) and began strictly enforcing the journal’s requirement that published authors post their data. Allan Dafoe, another political scientist, saw this as an opportunity to examine how well the enforcement of data sharing requirements served as a way to prevent research fraud. Watch the video to learn what he found.


Recently, many policies are being put into place that require research to be accessible to anyone through public archives. Political scientist Allan Dafoe, author of “Science Deserves Better: The Imperative to Share Complete Replication Files,” advocated for replication transparency, saying that “good research involves publishing complete replication files, making every step of research as explicit and reproducible as is practical.”

Unfortunately, many researchers still do a poor job preserving their data and it is too often lost. Dafoe’s paper simply argues that, with transparency and publication, “political science will become more refutable, open, cumulative, and accessible.” Without transparency, fraud threatens to reduce the public’s trust in science.

In “The Availability of Research Data Declines Rapidly with Article Age”, Timothy Vines, et al. also defend the importance of data transparency through an analysis of the effect of article age on data availability. The study formally investigated the relationships between a published paper’s age and four other probabilities:

  1. the probability of finding at least one working e-mail for a first, last, or corresponding author in order to request data;
  2. the conditional probability of a response, given that at least one e-mail appeared to work;
  3. the conditional probability of getting a response that indicated the status of the data, given that a response was received; and
  4. the conditional probability that the data were extant, given that an informative response was received.

The authors found a negative relationship between the age of the paper and the probability of finding at least one apparently working email, either through the journal or searching online. In fact, for each additional year, the chances of finding a working email fell by 7%. Additionally, there was a “negative relationship between age of the paper and the probability of the data set being extant (‘shared’ or ‘exists but unwilling to share’).” And, with each additional year after publication, the odds of data being extant decreased by 17%. Finally, they found a slightly positive effect of article age on working emails found via web searches. Data from older studies tended to not be available mainly because data sets were lost or stored in inaccessible media like Zip or floppy disks. Restoration of these data using modern computer infrastructure, therefore, would take an excessive amount of time.

Because of data’s potential usefulness in studies performed long after collection, the authors advocate for data preservation in public archives where it cannot be lost or withheld by authors.

These articles demonstrate how imperative data availability is for maintaining scientific credibility, both within the research community itself and in the public eye. In an effort to facilitate a transition toward more open data, Allan Dafoe makes various recommendations for how to produce good replication files:

For Statistical Studies:

  1. Do all data preparation and analysis in code.
  2. Adopt best practices for coding, including clarity in code, testing, and running code all the way through.
  3. Build all analysis from primary data files.
  4. Fully describe variables.
  5. Document every empirical claim.
  6. Archive your files.
  7. Encourage co-authors to adopt these standards.

For Journals:

  1. Require complete replication files before acceptance.
  2. Encourage high standards for replication files.
  3. Implement replication audits.
  4. Retract publications with non-replicable analyses.

Data sharing and transparency are scientific public goods, benefiting many and lowering the barrier to entry for students and junior researchers. Open science provides tools, incentivizes caution in study designs, and can produce much more credible research.

Why don’t you think scientists and researchers make more of an effort to preserve their data after publication?

If you want to dive deeper into the material, you can read the entirety of both papers by clicking on the links in the SEE ALSO section at the bottom of this page.


References

Dafoe, Allan. 2014. “Science Deserves Better: The Imperative to Share Complete Replication Files.” PS: Political Science & Politics 47 (1): 60–66. doi:10.1017/S104909651300173X.

Vines, Timothy H., Arianne YK Albert, Rose L. Andrew, Florence Débarre, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J. Rennison. 2014. “The availability of research data declines rapidly with article age.” Current biology 24 (1): 94-97.