Improving the Computational Reproducibility of Clinical Science: Tools for Open Data and Code

by Jeremy Eberle, MA
University of Virginia

Open data and analysis code promote computational reproducibility, or reproducing the results of an analysis when applying the same code to the same data (Nosek & Errington, 2020). Yet, in a random sample of articles published in “best practice” clinical psychology journals in 2017, only 2% reported data available for sharing (Nutu et al., 2019). Further, only a fraction of results from open datasets in psychology are reproducible (Hardwicke et al., 2021). Given that many researchers lack training in computational reproducibility (Obels et al., 2020), the present article presents five tools for openly sharing data and analysis code in clinical science, with a focus on R code (R Core Team, 2021). Companion code is available at https://doi.org/jfck.

Manage Software Versions

One challenge when reproducing results from one’s own or another researcher’s code is knowing the code’s dependencies (e.g., version numbers of R and any loaded packages). Broken scripts can be avoided by documenting the versions of any software used. Moreover, in R, the groundhog package can be used to load the latest versions of packages available on a given date (Simonsohn & Gruson, 2021). See “1_define_functions.R” for an initial script authors can include in their analysis pipeline that (a) compares the user’s R version against that used to write the code and (b) returns a date for the project. This script can be sourced (source function) at the top of all later scripts, warning the user if their R version differs and loading the project date for use with the groundhog.library function to load packages. The meta.groundhog function can be used to control the version of the groundhog package itself.

Reduce Duplication in Code

Another challenge is understanding the analysis code, which may be long and complex. In addition to clearly commenting code and using a style guide (e.g., tidyverse style guide; Wickham, 2001), authors can simplify their code by reducing duplication. For example, in R, functions can be used to extract repeated patterns of code for reuse, and iterative and functional programming tools (e.g., for loops, apply functions) can be used to apply the same procedures to multiple data objects at once. See “2_reduce_duplication.R” for an example of defining a function and applying it to a list of many data objects using either a for loop or the lapply function. See Wickham and Grolemund (2017) and Wickham (2019) for further instruction.

Track Changes and Collaborate

Git is an open source version control system for tracking changes in code, and GitHub is a website that hosts different versions of code for backup, open access, and collaboration. After connecting Git, GitHub, and RStudio (for tutorial, see Bryan, 2021), researchers can safely draft and test code changes on a branch separate from the main branch, compare code versions line by line in a diff (akin to tracked changes in Microsoft Word), commit changes with comments on the changes, tag commits with version numbers, push committed changes from one’s local computer to GitHub (or pull changes from GitHub to one’s local computer), open a pull request to merge changes into someone else’s code, track issues, and more. Git and GitHub enable transparency for all steps of an analysis, and GitHub’s integration with the Open Science Framework (OSF) connects GitHub code to OSF projects. See Vuorre and Curley (2018) for further instruction.

Create a Research Compendium

Open data and code are useful only to the extent that they are accessible, organized, and documented. Data and output should be stored in nonproprietary file formats (e.g., CSV files) and accompanied by a codebook defining the variable names and values. (Before sharing data, ensure all identifiers are removed.) Cleaning and analysis scripts can be numbered in the order they are to be run. All files for an analysis can be put in a parent folder with subfolders for data (and further subfolders for raw, intermediate, and clean data), code, and results. Scripts can be written so that if the working directory is set to the parent folder, relative file paths can input from one folder (e.g., ./data/clean) and output to another (e.g., ./results). The parent folder can also include a README file that overviews the project; notes the raw data source; and describes the computational environment, runtimes, functionality of each script, and file relations. For further instruction, see Marwick et al. (2018).

Release and Cite Data and Code

GitHub can also be used to release specific versions of code; this way, the version used for a given manuscript or subsequent analysis can be documented, accessed, and cited even if the code changes (e.g., during peer review, after publication). For example, Eberle, Baee, et al. (2022a, 2022b) used an adaptation of semantic versioning (https://semver.org/) to version releases of centralized data cleaning scripts for a large clinical trial (Eberle, Daniel, et al., 2022a, 2022b). In their SCHEMA.CONTENT.SCRIPT system (inspired by Koren, 2019), the first release is v1.0.0, the SCHEMA increments when code changes affect the schema of output data (e.g., adding/removing/renaming a table/column; recoding/changing a column’s meaning), the CONTENT increments when changes affect the data content but not the schema (e.g., adding/removing a row), and the SCRIPT increments when changes do not affect the data schema or content. (For details, see Eberle, Baee, et al.’s README, viewable on GitHub: https://github.com/TeachmanLab/MT-Data-CalmThinkingStudy/tree/v1.0.1.) GitHub’s integration with Zenodo mints a digital object identifier for each release for persistent tracking and citation.

Conclusion

These five tools aim to increase the utility of openly shared data and analysis code and thereby improve the computational reproducibility of research in clinical science, a foundational principle that is not yet a focus of most clinical training programs (Berenbaum et al., 2021). For more instruction in computational reproducibility, see Wilson et al. (2017), and consider asking the growing community of researchers interested in reproducibility for feedback (e.g., propose a paper for a ReproHack or review others’ papers; see https://www.reprohack.org/). Increasing the reproducibility of clinical science holds promise for improving the field’s evidence base and thus advancing our collective efforts to reduce the burden of mental illness.

______________________________________________________________
References

Berenbaum, H., Washburn, J. J., Sbarra, D., Reardon, K. W., Schuler, T., Teachman, B. A., Hollon, S. D., Atkins, M. S., Hamilton, J. L., Hetrick, W. P., Tackett, J. L., Cody, M. W., Klepac, R. K., & Lee, S. S. (2021). Accelerating the rate of progress in reducing mental health burdens: Recommendations for training the next generation of clinical psychologists. Clinical Psychology: Science and Practice, 28(2), 107–123. https://doi.org/10.1037/cps0000007

Bryan, J. (2021). Happy Git and GitHub for the useR. https://happygitwithr.com/index.html

Eberle, J. W., Baee, S., Behan, H. C., Baglione, A. N., Boukhechba, M., Funk, D. H., Barnes, L. E., & Teachman, B. A. (2022). TeachmanLab/MT-Data-CalmThinkingStudy: Centralized data cleaning for MindTrails Calm Thinking Study (v1.0.1). Zenodo. https://doi.org/10.5281/zenodo.6192907

Eberle, J. W., Daniel, K. E., Baee, S., Behan, H. C., Silverman, A. L., Calicho-Mamani, C., Baglione, A. N., Werntz, A., French, N. J., Ji, J. L., Hohensee, N., Tong, X., Boukhechba, M., Funk, D. H., Barnes, L. E., & Teachman, B. A. (2022a). Web-based interpretation training to reduce anxiety: A sequential multiple-assignment randomized trial. Manuscript in preparation.

Eberle, J. W., Daniel, K. E., Baee, S., Behan, H. C., Silverman, A. L., Calicho-Mamani, C., Baglione, A. N., Werntz, A., French, N. J., Ji, J. L., Hohensee, N., Tong, X., Boukhechba, M., Funk, D. H., Barnes, L. E., & Teachman, B. A. (2022b). Web-based interpretation training to reduce anxiety: A sequential multiple-assignment randomized trial [Preregistration, data, analysis code, materials]. https://doi.org/jfdx

Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. E., Long, B., Yoon, E. J., & Frank, M. C. (2021). Analytic reproducibility in articles receiving open data badges at the journal Psychological Science: an observational study. Royal Society Open Science.https://doi.org/10.1098/rsos.201494

Koren, M. (2019). Semantic versioning for data products. Medium.https://medium.com/data-architect/semantic-versioning-for-data-products-2b060962093

Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72(1). https://doi.org/gdhvm8

Nosek, B. A., & Errington, T. M. (2020). What is replication? PLoS Biology, 18(3). https://doi.org/ggqhsg

Nutu, D., Gentili, C., Naudet, F., & Cristea, I. A., (2019). Open science practices in clinical psychology journals: An audit study. Journal of Abnormal Psychology, 128(6), 510-516.
https://psycnet.apa.org/doi/10.1037/abn0000414

Obels, P., Lakens, D., Coles, N. A., Gottfried, J., & Green, S. A. (2020). Analysis of open data and computational reproducibility in Registered Reports in psychology. Advances in Methods and Practices in Psychological Science, 3(2), 229-237. https://doi.org/gg4vw4

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

Simonsohn, U., & Gruson, H. (2021). groundhog: The simplest solution to version-control for CRAN packages [computer software]. https://cran.r-project.org/package=groundhog

Vuorre, M., & Curley, J. P. (2018). Curating research assets: A tutorial on the Git version control system. Advances in Methods and Practices in Psychological Science, 1(2), 219-236. https://doi.org/10.1177/2515245918754826

Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/index.html

Wickham, H. (2021). The tidyverse style guide. https://style.tidyverse.org/

Wickham, H., & Grolemund, G. (2017). R for Data Science. https://r4ds.had.co.nz/index.html

Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS Computational Biology, 13(6). https://doi.org/gbkbwp

Disclaimer: The views and opinions expressed in this newsletter are those of the authors alone and do not necessarily reflect the official policy or position of the Psychological Clinical Science Accreditation System (PCSAS).