Internal replication: another tool for the reproducibility toolkit

By Jade Benjamin-Chung (University of California, Berkeley) and Benjamin F. Arnold (University of California, San Francisco)

Introduction from BITSS: Internal replication is a new tool in the reproducibility toolkit with which original study investigators replicate findings prior to submission to a peer-reviewed journal. Jade Benjamin-Chung (UC Berkeley) and Benjamin Arnold (UCSF) describe this process and explain how it can reduce errors and bias, improve the reproducibility of science, and contribute to more rapid scientific advances. Learn more in their recent paper and enjoy this read!

Scientists have recently adopted new strategies to increase the reproducibility of published literature. Some journals (e.g., American Journal of Political Science, journals of the American Economic Association) have instituted policies to review analytic code before publication to ensure results can be reproduced, and many more require sharing underlying data (e.g., PLOS journals). A growing number of scientists are attempting to replicate published study findings when original study data are publicly available, but it takes a new team a lot of time to replicate a published analysis.

Two recent retractions of high-profile articles on COVID-19 in The Lancet and New England Journal of Medicine illustrate the potential value of internal, pre-publication replication. Both articles analyzed data from the Surgisphere corporation, but the data and code were not shared due to privacy agreements. The Lancet study included an observational analysis that documented higher mortality among hospitalized patients who received hydroxychloroquine. Hundreds of scientists signed an open letter raising concerns about the provenance and quality of the Surgisphere dataset and the analytic methods used. In expressions of concern, both journals stated that they had requested additional information about the data reliability from the authors and had retracted the original articles in the meantime. Yet these retractions occurred after multiple, large-scale, randomized therapeutic trials for COVID-19 had paused their hydroxychloroquine arms, leading to a delay of the trials’ findings. Had the study authors been more scrutinous of the analysis internally, it could have preempted the wide distribution of misleading results that slowed scientific progress in the pandemic response.  

These retractions may in part reflect the unusual pressure for scientists to rapidly publish their work during the COVID-19 pandemic. However, even under typical circumstances, there is a great need for stronger internal procedures to ensure the quality of data and statistical analyses prior to publication. Marc Lipsitch, Professor of Epidemiology at Harvard University, recently tweeted that these retractions remind us of the importance of the 4-layer safety net “that keeps science functioning”: 1) ethical data handling (i.e., no falsification); 2) internal efforts to identify and resolve errors and limitations before publication, and clear description of limitations in publications; 3) sharing data and replication code when possible; and 4) rigorous peer review. 

“As scientists, we should accept that errors and cognitive biases are human tendencies and use strategies to minimize them.”

Our recent article in Gates Open Research proposes a tool to assist scientists in ensuring layer 2 of the safety net: “internal replication” by the original study investigators before submission to a peer-reviewed journal. We had heard about this practice in private industry (e.g., pharmaceutical industry trials or litigation consulting firms), but we hadn’t seen clear explanations in the academic literature, so we developed this process for the analysis of two large cluster-randomized trials with multiple intervention arms and numerous outcomes (Luby et al., 2018; Null et al., 2018). Here, we describe this process and explain how it should reduce errors and bias, particularly when coupled with pre-analysis plans and masking to experimental group assignments. Internal replication should contribute to more rapid scientific advances by improving the quality of published evidence. 

What’s wrong with the status quo?

In disciplines whose research includes computational workflows—epidemiology, economics, social science, and even biological and physical science—it is common for a single analyst to perform all computation and error checking without independent replication before submitting a manuscript for publication. The status quo approach to computational analyses introduces at least two potential threats to reproducibility: 

  • Errors: Since a typical computational workflow may involve thousands of lines of analytic code, small coding errors are inevitable, but may result in vastly different findings and policy implications. 
  • Cognitive biases: Even when researchers adhere to a pre-analysis plan, they must still make many judgment calls when manipulating and analyzing data. Researchers tend to confirm their own beliefs—consciously or not—when making these decisions, and sometimes this bias means they are more likely to obtain a statistically significant finding. In the absence of a pre-analysis plan, these threats to validity are even greater. Researchers may also be more likely to thoroughly check results that depart from their expectations than those that confirm their expectations, resulting in “disconfirmation bias” (Nuzzo, 2015). 

As scientists, we should accept that errors and cognitive biases are human tendencies and use strategies to minimize them. We make the case that practicing internal replication should increase the reproducibility of published evidence and the efficiency of the scientific process. 

What is internal replication, and how does it work? 

Internal replication is a process through which investigators from an original study team independently replicate a computational workflow in order to identify and resolve errors, helping to thwart biases that occur during computational analyses prior to publication. In many ways, this practice is akin to replication of experiments in different laboratories. 

Here’s how analysts can apply it: 

  1. Once data collection is complete, review the pre-analysis plan (if one exists) and agree with your co-investigators upon any changes to it. 
  2. Choose a tolerance level, or the maximum difference between the results that will be allowed in order to consider them replicated. 
  3. Prepare computational datasets independently without sharing any code with your co-investigators. Typically this involves merging raw datasets and creating variables for analysis. 
  4. Compare features of your computational datasets to ensure they are functionally the same (e.g., each variable has the same range and mean). If you find discrepancies, repeat steps 2 and 3 until you resolve them. 
  5. Conduct analyses independently from your co-investigators without sharing any code. 
  6. Compare results, repeating step 4 until the difference between your results is smaller than the tolerance level. While performing internal replication, we found it helpful to use a ShinyR dashboard to compare results efficiently—example code for this dashboard is available here
  7. Once the difference in the results is smaller than the tolerance level, your results are internally replicated! 

Internal replication requires additional resources compared to a study with a single analyst. If resources are limited, one option is to internally replicate only the most error-prone portion of the analysis. If only one analyst is available, they can replicate their own code by using an alternative software package or by writing the same error-prone portion of their code twice. 

“Internal replication should contribute to more rapid scientific advances by improving the quality of published evidence.”

How does internal replication reduce errors and bias? 

The internal replication process can reduce the following forms of errors and bias: 

  • Reducing errors: When coding independently, you are unlikely to make identical mistakes to co-investigators. As such, coding errors are likely to result in discrepancies in datasets and results that require discussion and correction in order to complete the internal replication process. 
  • Reducing confirmation bias: Even when using a pre-analysis plan, many small judgment calls must be made while coding, such as how to handle missing values or generate complex composite variables. These seemingly inconsequential decisions are brought to light when comparing differences in data and results, providing an opportunity to discuss these decisions and assumptions transparently. 
  • Reducing disconfirmation bias: The internal replication process requires you to comprehensively compare all independently generated results—not just those that do not conform with expectation. 

What needs to happen for internal replication to be adopted more broadly? 

There are several ways journals and funders can support researchers who wish to adopt internal replication: 

  • Grant application criteria that favor proposals that involve internal replication will incentivize researchers to consider it from the earliest stages of planning. For example, the National Institutes of Health has established rigor and reproducibility criteria for funding applications—these could include internal replication as one of many tools for enhancing reproducibility. 
  • Financial support from funders for internal replication would allow researchers to recruit additional analysts to assist with replication and to plan ahead since internal replication may affect the timeline for analysis completion. 
  • Scientific manuscript review criteria for editors and peer reviewers could include internal replication as one of many tools for enhancing reproducibility, and manuscripts that can demonstrate they have completed internal replication could receive expedited review. 
  • Kite-marks / badges could be placed on the front page of articles to recognize articles that performed internal replication. The journals Psychological Science and Biostatistics recently introduced badges for studies that adhered to open science and reproducibility practices, which increased submissions that shared data in an accessible fashion (Kidwell et al., 2016; Peng, 2011; Rowhani-Farid et al., 2018).

While human error and confirmation bias are inevitable, scientists can use tools that anticipate and minimize them. Internal replication is one such tool. If adopted widely, internal replication should increase the reliability of published evidence and the efficiency of the scientific process. 


Jade Benjamin-Chung, PhD MPH, is an Epidemiologist at UC Berkeley. Her research applies cutting edge causal inference and machine learning techniques to study interventions to control, eliminate, or eradicate infectious diseases, including interventions to prevent malaria, diarrhea, soil-transmitted helminths, and influenza. She has conducted research in Haiti, Thailand, Myanmar, Bangladesh, and Kenya. She is the recipient of a K01 Career Development Award from the National Institute of Allergy and Infectious Diseases focused on novel epidemiologic methods to evaluate malaria eradication interventions in southern Africa. 

Benjamin F. Arnold, PhD is an Assistant Professor in the Francis I. Proctor Foundation at UCSF. He is an infectious disease epidemiologist and biostatistician with expertise in clinical trials and causal inference methods. His research focuses on seroepidemiology, interventions to reduce infection and improve nutrition in low-resource settings, and neglected tropical disease elimination.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.