Reproducible Research
Jasper Slingsby
The Reproducibility Crisis
“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)
- Replication is one of the fundamental tenets of science
- Findings from studies that cannot be independently replicated should be treated with caution!
- Either they are not generalisable (cf. prediction) or worse, there was an error in the study!
The Reproducibility Crisis
Sadly, we have a problem…
![]()
‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).
Reproducible Research
Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
- This is central to scientific progress!!!
- BONUS: working reproducibly facilitates automated workflows, which is useful for applications like iterative near-term ecological forecasting!
Replication vs Reproducibility
- Reproducibility falls short of full replication because it focuses on reproducing the same result from the same data set, rather than analyzing independently collected data.
- This difference may seem trivial, but you’d be surprised at how few studies are even reproducible, let alone replicable.
Replication and the Reproducibility Spectrum
![]()
- Full replication is a huge challenge, and sometimes impossible, e.g.
- rare phenomena, long term records, very expensive projects like space missions, etc
- Where the “gold standard” of full replication cannot be achieved, we have to settle for a lower rung somewhere on The Reproducibility Spectrum (Peng 2011)
Why work reproducibly?
![]()
Let’s start being more specific about our miracles… Cartoon © Sidney Harris. Used with permission ScienceCartoonsPlus.com
Why work reproducibly?
“Five selfish reasons to work reproducibly” (Markowetz 2015)
- Its transparent and open - helping avoid mistakes or track down errors
- It makes it easier to write papers - faster tracking of changes and manuscript updates
- It helps the review process - reviewers can actually see (and do!) what you did
- It enables continuity of research - simplifying project handover (esp. past to future you!)
- It builds reputation - showing integrity and gaining credit where your work is reused
Why work reproducibly?
Some less selfish reasons:
It speeds scientific progress facilitating building on previous findings and analyses
It allows easy comparison of new analytical approaches to older ones
It makes it easy to repeat analyses on new data, e.g. for ecological forecasting or LTER1
The tools are useful beyond research, e.g. making websites, presentations
Reproducible research skills are highly sought after!
- Skills are important should you decide to leave science…
- Within science, more and more environmental organizations and NGOs are hiring data scientists or scientists with strong data and quantitative skills
Barriers to working reproducibly
From “A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021):
1. Complexity
- There’s a learning curve in getting to know and use the tools effectively
- One is always tempted by the “easy option” of doing it the way you already know or using “user-friendly” proprietary software
2. Technological change
- Hardware and software change over time, making it difficult to rerun old analyses
- This should be less of a problem as more tools like contained computing environments become available
Barriers to working reproducibly
3. Human error
- Simple mistakes or poor documentation can easily make a study irreproducible.
- Most reproducible research tools are actually aimed at solving this problem!
4. Intellectual property rights
- Rational self-interest can lead to hesitation to share data and code via many pathways:
- Fear of not getting credit; Concern that the materials shared will be used incorrectly or unethically; etc
- Hopefully most of these issues will be solved by better awareness of licensing issues, attribution, etc, as the culture of reproducible research grows
Reproducible Scientific Workflows
![]()
‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
Reproducible Scientific Workflows
Entail overlapping/intertwined components, namely:
- Data management
- File and folder management
- Coding and code management (data manipulation and analyses)
- Computing environment and software
- Sharing of the data, metadata, code, publications and any other relevant materials
1. Data management
This is a big topic and has a separate section in my notes.
Read the notes as this is NB information for you to know.
1. Data management
Data loss is the norm… Good data management is key!!!
![]()
The ‘Data Decay Curve’ (Michener et al. 1997)
1. Data management
![]()
The Data Life Cycle, adapted from https://www.dataone.org/
Plan
Good data management begins with planning. You essentially outline the plan for every step of the cycle in as much detail as possible.
Fortunately, there are online data management planning tools that make it easy to develop a Data Management Plan (DMP).
![]()
Screenshot of UCT’s Data Management Planning Tool’s Data Management Checklist.
A DMP is a living document and should be regularly revised during the life of a project!
Collect & Assure
I advocate that it is foolish to collect data without doing quality assurance and quality control (QA/QC) as you go, irrespective of how you are collecting the data.
![]()
An example data collection app I built in AppSheet that allows you to log GPS coordinates, take photos, record various fields, etc.
There are many tools that allow you to do quality assurance and quality control as you collect the data (or progressively shortly after data collection events). Even just MS Excel or GoogleSheets with controlled fields etc.
Describe, Preserve, Discover
Global databases:
- GenBank - for molecular data
- TRY - for plant traits
- Dryad - for general biological and environmental data
South African databases:
- SANBI (biodiversity), SAEON (environmental and biodiversity)
“Generalist” repositories:
Integrate & Analyse
“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below
![]()