“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)
Replication is one of the fundamental tenets of science
Findings from studies that cannot be independently replicated should be treated with caution!
Either they are not generalisable (cf. prediction) or worse, there was an error in the study!
The Reproducibility Crisis
Sadly, we have a problem…
‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).
Reproducible Research
Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
This is central to scientific progress!!!
BONUS: working reproducibly facilitates automated workflows needed for iterative ecological forecasting!
Replication vs Reproducibility
Reproducibility falls short of full replication because it focuses on reproducing the same result from the same data set, rather than analyzing independently collected data.
This difference may seem trivial, but you’d be surprised at how few studies are even reproducible, let alone replicable.
Replication and the Reproducibility Spectrum
Full replication is a huge challenge, and sometimes impossible, e.g.
rare phenomena, long term records, very expensive projects like space missions, etc
Where the “gold standard” of full replication cannot be achieved, we have to settle for a lower rung somewhere on The Reproducibility Spectrum(Peng 2011)
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
Automation - reusing code is one click, and you’re unlikely to introduce errors
A script provides a record of your analysis
Uninterrupted workflows - scientific coding languages like Python or R allow you to run almost any kind of analysis in one scripted workflow
GIS, phylogenetics, multivariate or Bayesian statistics, etc
saves you manually exporting and importing data between softwares
Most coding languages are open source (e.g. R, Python, JavaScript, etc)
Free! No one has to pay to reuse any code you share
Transparent - You (and others) can check the background code and functions you’re using, not just the software company
A culture of sharing code (online forums, with publications, etc)
Some coding rules
It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!
Check out the Tidyverse style guide for R-specific guidance, but here are some basics:
use consistent, meaningful and distinct names for variables and functions
use consistent code and formatting style - indents, spaces, line-breaks, etc
modularize code into manageable steps/chunks
or separate scripts that can be called in order from a master script or Makefile
use commenting to explain what you’re doing at each step or in each function
“notebooks” like RMarkdown, Quarto, Jupyter or Sweave allow embedded code, simplifying documentation, master/Makefiles, etc and can used to write manuscripts, presentations or websites (e.g. all my teaching materials)
write functions rather than repeating the same code
check for mistakes at every step!!! Do the outputs make sense?
Some coding rules continued…
start with a “recipe”
outline the steps/modules before you start coding to keep you on track
e.g. a common recipe in R (using commented headers):
#Header indicating purpose, author, date, version etc#Define settings and load required libraries#Read in data#Wrangle/reformat/clean/summarize data as required#Run analyses (often multiple steps)#Wrangle/reformat/summarize analysis outputs for visualization#Visualize outputs as figures or tables
avoid proprietary formats! i.e. use open source scripting languages and file formats
use version control!!!
Version control
Version control tools can be challenging , but also hugely simplify your workflow!
The advantages of version control1:
They generally help project management, especially collaborations
They allow easy code sharing with collaborators or the public at large - through repositories (“repos”) or gists (code snippets)
The system is online, but you can also work offline by cloning the repo to your local PC. You can “push to” or “pull from” the online repo to keep versions in sync.
Changes are tracked and reversible through commits
Any changes in a repo must be commited with a commit message. Each commit is a recoverable version that can be compared or reverted to
This is the essence of version control and magically frees you from duplicated files!
Version control continued…
Users can easily adapt or build on each others’ code by forking repos and working on their own branch.
This allows you to repeat/replicate analyses or even build websites (like this one!)
Collaborators can propose changes via pull requests
Repo owners can accept and integrate changes seamlessly by review and merge the forked branch back to the main branch
Comments associated with commit or pull requests provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming when things go wrong
You can assign, log and track issues and feature requests
This should all make more sense after the practical, but here are some pretty pictures to drive some of this home…
Sharing your code and data is not enough to maintain reproducibility…
Software and hardware change with upgrades, versions or user community preferences!
You’ll all know MicroSoft Excel, but have you heard of Quattro Pro or Lotus that were the preferred spreadsheet software of yesteryear?
The simple solution is to carefully document the hardware and versions of software used so that others can recreate that computing environment if needed.
In R, you can simply run the sessionInfo() function, giving details like so:
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.2.2 fastmap_1.1.0 cli_3.6.0 tools_4.2.2
[5] htmltools_0.5.4 rstudioapi_0.14 yaml_2.3.7 rmarkdown_2.20
[9] knitr_1.42 xfun_0.36 digest_0.6.31 jsonlite_1.8.4
[13] rlang_1.0.6 evaluate_0.20
These are contained, lightweight computing environments similar to virtual machines, that you can package with your software/workflow.
You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly first time.
5. Sharing data, code, publication etc
This is covered in more detail in the data management lecture, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…
Another key component here is that ideally all your data, code, publication etc are shared Open Access - i.e. they are not stuck behind some paywall…
A 3-step, 10-point checklist to guide researchers toward greater reproducibility (Alston and Rick 2021).
References
Alston, Jesse M, and Jessica A Rick. 2021. “A beginner’s guide to conducting reproducible research.”Bulletin of the Ecological Society of America 102 (2). https://doi.org/10.1002/bes2.1801.
Baker, Monya. 2016. “1,500 scientists lift the lid on reproducibility.”Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.