Jasper Slingsby
“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)
Sadly, we have a problem…
‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).
Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
Let’s start being more specific about our miracles… Cartoon © Sidney Harris. Used with permission ScienceCartoonsPlus.com
“Five selfish reasons to work reproducibly” (Markowetz 2015)
Some less selfish reasons:
It speeds scientific progress facilitating building on previous findings and analyses
It allows easy comparison of new analytical approaches to older ones
It makes it easy to repeat analyses on new data, e.g. for ecological forecasting or LTER1
The tools are useful beyond research, e.g. making websites, presentations
Reproducible research skills are highly sought after!
From “A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021):
‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
Entail overlapping/intertwined components, namely:
This is a big topic and has a separate section in my notes.
Read the notes as this is NB information for you to know.
Data loss is the norm… Good data management is key!!!
The ‘Data Decay Curve’ (Michener et al. 1997)
The Data Life Cycle, adapted from https://www.dataone.org/
Good data management begins with planning. You essentially outline the plan for every step of the cycle in as much detail as possible.
Fortunately, there are online data management planning tools that make it easy to develop a Data Management Plan (DMP).
Screenshot of UCT’s Data Management Planning Tool’s Data Management Checklist.
A DMP is a living document and should be regularly revised during the life of a project!
I advocate that it is foolish to collect data without doing quality assurance and quality control (QA/QC) as you go, irrespective of how you are collecting the data.
An example data collection app I built in AppSheet that allows you to log GPS coordinates, take photos, record various fields, etc.
There are many tools that allow you to do quality assurance and quality control as you collect the data (or progressively shortly after data collection events). Even just MS Excel or GoogleSheets with controlled fields etc.
“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below
Artwork @allison_horst
Project files and folders can get unwieldy fast and really bog you down!
The main considerations are:
Most projects have similar requirements
Here’s how I usually manage my folders:
“Point-and-click” software like Excel, Statistica, SPSS etc may seem easier, but you’ll regret it in the long run… e.g. When you have to rerun or remember what you did?1
Coding is communication. Messy code is bad communication. Bad communication hampers collaboration and makes it easier to make mistakes…
Streamline, collaborate, reuse, contribute, and fail safely…
It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!
Check out the Tidyverse style guide for R-specific guidance, but here are some basics:
#Header indicating purpose, author, date, version etc
#Define settings and load required libraries
#Read in data
#Wrangle/reformat/clean/summarize data as required
#Run analyses (often multiple steps)
#Wrangle/reformat/summarize analysis outputs for visualization
#Visualize outputs as figures or tables
Version control tools can be challenging , but also hugely simplify your workflow!
The advantages of version control1:
repositories
(“repos”) or gists
(code snippets)cloning
the repo to your local PC. You can “push
to” or “pull
from” the online repo to keep versions in sync.commits
commit
ed with a commit message
. Each commit
is a recoverable version
that can be compared
or reverted
toforking
repos and working on their own branch
.
pull requests
owners
can accept and integrate changes seamlessly by review
and merge
the forked branch back to the main
branchcommit
or pull request
s provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming
when things go wrongassign
, log and track issues
and feature requests
Artwork by @allison_horst CC-BY-4.0
Artwork by @allison_horst CC-BY-4.0
Interestingly, since all that’s tracked are the commits, whereby versions are named (the nodes in the image). All that the online Git repo records is this figure below. The black is the the OWNER’s main branch and the blue is the COLLABORATOR’s fork.
Sharing your code and data is not enough to maintain reproducibility…
Software and hardware change with upgrades, versions or user community preferences!
The simple solution is to carefully document the hardware and versions of software used so that others can recreate that computing environment if needed.
sessionInfo()
function, giving details like so:R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Africa/Johannesburg
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.3 knitr_1.48 rlang_1.1.4
[5] xfun_0.47 generics_0.1.3 jsonlite_1.8.9 labeling_0.4.3
[9] glue_1.8.0 colorspace_2.1-1 htmltools_0.5.8.1 scales_1.3.0
[13] fansi_1.0.6 rmarkdown_2.28 grid_4.4.1 evaluate_0.24.0
[17] munsell_0.5.1 tibble_3.2.1 fastmap_1.2.0 yaml_2.3.10
[21] lifecycle_1.0.4 compiler_4.4.1 dplyr_1.1.4 RColorBrewer_1.1-3
[25] pkgconfig_2.0.3 rstudioapi_0.16.0 farver_2.1.2 digest_0.6.37
[29] R6_2.5.1 tidyselect_1.2.1 utf8_1.2.4 pillar_1.9.0
[33] magrittr_2.0.3 withr_3.0.2 tools_4.4.1 gtable_0.3.5
A better solution is to use containers like docker or singularity.
These are contained, lightweight computing environments similar to virtual machines, that you can package with your software/workflow.
You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly every time.
This is covered in more detail in the data management section, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…
Another key component here is that ideally all your data, code, publication etc are shared Open Access