Reproducible Research

Jasper Slingsby

The Reproducibility Crisis

“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)

Replication is one of the fundamental tenets of science
Findings from studies that cannot be independently replicated should be treated with caution!
- Either they are not generalisable (cf. prediction) or worse, there was an error in the study!

The Reproducibility Crisis

Sadly, we have a problem…

‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).

Reproducible Research

Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
- This is central to scientific progress!!!

BONUS: working reproducibly facilitates automated workflows, which is useful for applications like iterative near-term ecological forecasting!

Replication vs Reproducibility

Reproducibility falls short of full replication because it focuses on reproducing the same result from the same data set, rather than analyzing independently collected data.

This difference may seem trivial, but you’d be surprised at how few studies are even reproducible, let alone replicable.

Replication and the Reproducibility Spectrum

Full replication is a huge challenge, and sometimes impossible, e.g.
- rare phenomena, long term records, very expensive projects like space missions, etc
Where the “gold standard” of full replication cannot be achieved, we have to settle for a lower rung somewhere on The Reproducibility Spectrum (Peng 2011)

Why work reproducibly?

Let’s start being more specific about our miracles… Cartoon © Sidney Harris. Used with permission ScienceCartoonsPlus.com

Why work reproducibly?

“Five selfish reasons to work reproducibly” (Markowetz 2015)

Its transparent and open - helping avoid mistakes or track down errors
It makes it easier to write papers - faster tracking of changes and manuscript updates
It helps the review process - reviewers can actually see (and do!) what you did
It enables continuity of research - simplifying project handover (esp. past to future you!)
It builds reputation - showing integrity and gaining credit where your work is reused

Why work reproducibly?

Some less selfish reasons:

It speeds scientific progress facilitating building on previous findings and analyses
It allows easy comparison of new analytical approaches to older ones
It makes it easy to repeat analyses on new data, e.g. for ecological forecasting or LTER¹
The tools are useful beyond research, e.g. making websites, presentations
Reproducible research skills are highly sought after!

Skills are important should you decide to leave science…
Within science, more and more environmental organizations and NGOs are hiring data scientists or scientists with strong data and quantitative skills

Barriers to working reproducibly

From “A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021):

1. Complexity

There’s a learning curve in getting to know and use the tools effectively
- One is always tempted by the “easy option” of doing it the way you already know or using “user-friendly” proprietary software

2. Technological change

Hardware and software change over time, making it difficult to rerun old analyses
- This should be less of a problem as more tools like contained computing environments become available

Barriers to working reproducibly

3. Human error

Simple mistakes or poor documentation can easily make a study irreproducible.
- Most reproducible research tools are actually aimed at solving this problem!

4. Intellectual property rights

Rational self-interest can lead to hesitation to share data and code via many pathways:
- Fear of not getting credit; Concern that the materials shared will be used incorrectly or unethically; etc
- Hopefully most of these issues will be solved by better awareness of licensing issues, attribution, etc, as the culture of reproducible research grows

Reproducible Scientific Workflows

‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.

Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.

Reproducible Scientific Workflows

Entail overlapping/intertwined components, namely:

Data management
File and folder management
Coding and code management (data manipulation and analyses)
Computing environment and software
Sharing of the data, metadata, code, publications and any other relevant materials

1. Data management

This is a big topic and has a separate section in my notes.

Read the notes as this is NB information for you to know.

1. Data management

Data loss is the norm… Good data management is key!!!

The ‘Data Decay Curve’ (Michener et al. 1997)

1. Data management

The Data Life Cycle, adapted from https://www.dataone.org/

Plan

Good data management begins with planning. You essentially outline the plan for every step of the cycle in as much detail as possible.

Fortunately, there are online data management planning tools that make it easy to develop a Data Management Plan (DMP).

Screenshot of UCT’s Data Management Planning Tool’s Data Management Checklist.

A DMP is a living document and should be regularly revised during the life of a project!

Collect & Assure

I advocate that it is foolish to collect data without doing quality assurance and quality control (QA/QC) as you go, irrespective of how you are collecting the data.

An example data collection app I built in AppSheet that allows you to log GPS coordinates, take photos, record various fields, etc.

There are many tools that allow you to do quality assurance and quality control as you collect the data (or progressively shortly after data collection events). Even just MS Excel or GoogleSheets with controlled fields etc.

Describe, Preserve, Discover

The FAIR data principles ErrantScience.com.

Global databases:

GenBank - for molecular data
TRY - for plant traits
Dryad - for general biological and environmental data

South African databases:

SANBI (biodiversity), SAEON (environmental and biodiversity)

“Generalist” repositories:

FigShare (incl. UCT’s ZivaHub), Zenodo

Integrate & Analyse

“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below

Artwork @allison_horst

2. File and folder management

‘Documents’ from xkcd.com/1459, used under a CC-BY-NC 2.5 license.

Project files and folders can get unwieldy fast and really bog you down!

The main considerations are:

defining a simple, common, intuitive folder structure
using informative file names
version control where possible
- e.g. GitHub, Google Docs, etc

Folders

Most projects have similar requirements

Here’s how I usually manage my folders:

“code”contains code for analyses
“data” often has separate “raw” and “processed” (or “clean”) folders
- Large files (e.g. GIS) may be stored elsewhere
“output” contains figures and tables

Names should be

machine readable
- avoid spaces and funny punctuation
- support searching and splitting, e.g. “data_raw.csv”, “data_clean.csv” can be searched by keywords and split into fields by “_”
human readable
- contents self evident from the name
support sorting
- numeric or character prefixes separate files by component or step
- folder structure helps here too

3. Coding and code management

Why write code?

“Point-and-click” software like Excel, Statistica, SPSS etc may seem easier, but you’ll regret it in the long run… e.g. When you have to rerun or remember what you did?¹

Coding rules

Coding is communication. Messy code is bad communication. Bad communication hampers collaboration and makes it easier to make mistakes…

Version control

Streamline, collaborate, reuse, contribute, and fail safely…

Why write code?

Automation - reusing code is one click, and you’re unlikely to introduce errors
A script provides a record of your analysis
Uninterrupted workflows - scientific coding languages like Python or R allow you to run almost any kind of analysis in one scripted workflow
- GIS, phylogenetics, multivariate or Bayesian statistics, etc
- saves you manually exporting and importing data between softwares
Most coding languages are open source (e.g. R, Python, JavaScript, etc)
- Free! No one has to pay to reuse any code you share
- Transparent - You (and others) can check the background code and functions you’re using, not just the software company
- A culture of sharing code (online forums, with publications, etc)

Some coding rules

It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!

Check out the Tidyverse style guide for R-specific guidance, but here are some basics:

use consistent, meaningful and distinct names for variables and functions
use consistent code and formatting style - indents, spaces, line-breaks, etc
modularize code into manageable steps/chunks
- or separate scripts that can be called in order from a master script or Makefile
- write functions rather than repeating the same code
use commenting to explain what you’re doing at each step or in each function
- “notebooks” like RMarkdown, Quarto, Jupyter or Sweave allow embedded code, simplifying documentation, master/Makefiles, etc and can be used to write manuscripts, presentations or websites (e.g. all my teaching materials)
check for mistakes at every step!!! Do the outputs make sense?

Some coding rules continued…

start with a “recipe”
- outline the steps/modules before you start coding to keep you on track
- e.g. a common recipe in R (using commented headers):

#Header indicating purpose, author, date, version etc

#Define settings and load required libraries

#Read in data

#Wrangle/reformat/clean/summarize data as required

#Run analyses (often multiple steps)

#Wrangle/reformat/summarize analysis outputs for visualization

#Visualize outputs as figures or tables

avoid proprietary formats! i.e. use open source scripting languages and file formats
use version control!!!

Version control

Version control tools can be challenging , but also hugely simplify your workflow!

The advantages of version control¹:

They generally help project management, especially collaborations
They allow easy code sharing with collaborators or the public at large - through repositories (“repos”) or gists (code snippets)
The system is online, but you can also work offline by cloning the repo to your local PC. You can “push to” or “pull from” the online repo to keep versions in sync.
Changes are tracked and reversible through commits
- Any changes in a repo must be commited with a commit message. Each commit is a recoverable version that can be compared or reverted to
- This is the essence of version control and magically frees you from duplicated files!

Version control continued…

Users can easily adapt or build on each others’ code by forking repos and working on their own branch.
- This allows you to repeat/replicate analyses or even build websites (like this one!)

Collaborators can propose changes via pull requests
- Repo owners can accept and integrate changes seamlessly by review and merge the forked branch back to the main branch
- Comments associated with commit or pull requests provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming when things go wrong
You can assign, log and track issues and feature requests

Version control in pretty pictures

Artwork by @allison_horst CC-BY-4.0

Version control in pretty pictures

Artwork by @allison_horst CC-BY-4.0

Version control - example workflow

Interestingly, since all that’s tracked are the commits, whereby versions are named (the nodes in the image). All that the online Git repo records is this figure below. The black is the the OWNER’s main branch and the blue is the COLLABORATOR’s fork.

4. Computing environment

Sharing your code and data is not enough to maintain reproducibility…

Software and hardware change with upgrades, versions or user community preferences!

You’ll all know MicroSoft Excel, but have you heard of Quattro Pro or Lotus that were the preferred spreadsheet software of yesteryear?

The simple solution is to carefully document the hardware and versions of software used so that others can recreate that computing environment if needed.

In R, you can simply run the sessionInfo() function, giving details like so:

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Africa/Johannesburg
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.5.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        cli_3.6.3          knitr_1.48         rlang_1.1.4       
 [5] xfun_0.47          generics_0.1.3     jsonlite_1.8.9     labeling_0.4.3    
 [9] glue_1.8.0         colorspace_2.1-1   htmltools_0.5.8.1  scales_1.3.0      
[13] fansi_1.0.6        rmarkdown_2.28     grid_4.4.1         evaluate_0.24.0   
[17] munsell_0.5.1      tibble_3.2.1       fastmap_1.2.0      yaml_2.3.10       
[21] lifecycle_1.0.4    compiler_4.4.1     dplyr_1.1.4        RColorBrewer_1.1-3
[25] pkgconfig_2.0.3    rstudioapi_0.16.0  farver_2.1.2       digest_0.6.37     
[29] R6_2.5.1           tidyselect_1.2.1   utf8_1.2.4         pillar_1.9.0      
[33] magrittr_2.0.3     withr_3.0.2        tools_4.4.1        gtable_0.3.5

4. Computing environment cont.

Containers

A better solution is to use containers like docker or singularity.

These are contained, lightweight computing environments similar to virtual machines, that you can package with your software/workflow.

You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly every time.

References

Alston, Jesse M, and Jessica A Rick. 2021. “A beginner’s guide to conducting reproducible research.” Bulletin of the Ecological Society of America 102 (2). https://doi.org/10.1002/bes2.1801.

Baker, Monya. 2016. “1,500 scientists lift the lid on reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

Markowetz, Florian. 2015. “Five selfish reasons to work reproducibly.” Genome Biology 16 (December): 274. https://doi.org/10.1186/s13059-015-0850-7.

Michener, William K, James W Brunt, John J Helly, Thomas B Kirchner, and Susan G Stafford. 1997. “Nongeospatial data for the ecological sciences.” Ecological Applications: A Publication of the Ecological Society of America 7 (1): 330–42. https://doi.org/10.1890/1051-0761(1997)007[0330:nmftes]2.0.co;2.

Peng, Roger D. 2011. “Reproducible research in computational science.” Science 334 (6060): 1226–27. https://doi.org/10.1126/science.1213847.

Penny, Dan. 2016. “Nature Reproducibility survey,” May. https://doi.org/10.6084/m9.figshare.3394951.v1.

Reproducible Research

The Reproducibility Crisis

The Reproducibility Crisis

Reproducible Research

Replication vs Reproducibility

Replication and the Reproducibility Spectrum

Why work reproducibly?

Why work reproducibly?

Why work reproducibly?

Barriers to working reproducibly

1. Complexity

2. Technological change

Barriers to working reproducibly

3. Human error

4. Intellectual property rights

Reproducible Scientific Workflows

Reproducible Scientific Workflows

1. Data management

1. Data management

1. Data management

Plan

Collect & Assure

Describe, Preserve, Discover

Integrate & Analyse

2. File and folder management

Folders

Names should be

3. Coding and code management

Why write code?

Coding rules

Version control

Why write code?

Some coding rules

Some coding rules continued…

Version control

Version control continued…

Version control in pretty pictures

Version control in pretty pictures

Version control - example workflow

Version control - example workflow

4. Computing environment

4. Computing environment cont.

Containers

5. Sharing data, code, publication etc

References