Jasper Slingsby
Especially for an operational forecast system…
“Replication is the ultimate standard by which scientific claims are judged.” - Peng (2011)
Sadly, we have a problem…
‘Is there a reproducibility crisis?’ - A survey of >1500 scientists (Baker 2016; Penny 2016).
Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
Let’s start being more specific about our miracles… Cartoon © Sidney Harris. Used with permission ScienceCartoonsPlus.com
“Five selfish reasons to work reproducibly” (Markowetz 2015)
Some less selfish reasons (and relevant for ecoforecasting):
It speeds scientific progress facilitating building on previous findings and analyses
It allows easy comparison of new analytical approaches to older ones
It makes it easy to repeat analyses on new data, e.g. for ecological forecasting or LTER1
The tools are useful beyond research, e.g. making websites, presentations
Reproducible research skills are highly sought after!
From “A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021):
1. Complexity
2. Technological change
3. Human error
4. Intellectual property rights
‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
Entail overlapping/intertwined components, namely:
This is a big topic and has a separate section in my notes.
Read the notes as this is NB information for you to know, and the content is still be examinable - although I will not expect you to know it in as much detail.
Data loss is the norm… Good data management is key!!!
The ‘Data Decay Curve’ (Michener et al. 1997)
The Data Life Cycle, adapted from https://www.dataone.org/
Good data management begins with planning. You essentially outline the plan for every step of the cycle in as much detail as possible.
Fortunately, there are online data management planning tools that make it easy to develop a Data Management Plan (DMP).
Screenshot of UCT’s Data Management Planning Tool’s Data Management Checklist.
A DMP is a living document and should be regularly revised during the life of a project!
I advocate that it is foolish to collect data without doing quality assurance and quality control (QA/QC) as you go, irrespective of how you are collecting the data.
An example data collection app I built in AppSheet that allows you to log GPS coordinates, take photos, record various fields, etc.
There are many tools that allow you to do quality assurance and quality control as you collect the data (or progressively shortly after data collection events).
“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below
Artwork @allison_horst
Specific forecasting requirements:
Project files and folders can get unwieldy fast and really bog you down!
The main considerations are:
Most projects have similar requirements
Here’s how I usually manage my folders:
“Point-and-click” software like Excel, Statistica, etc may seem easier, but you’ll regret it in the long run… e.g. When you have to rerun or remember what you did?
Coding is communication. Messy code is bad communication. Bad communication hampers collaboration and makes it easier to make mistakes…
Streamline, collaborate, reuse, contribute, and fail safely…
It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!
Check out the Tidyverse style guide for R-specific guidance, but here are some basics:
#Header indicating purpose, author, date, version etc
#Define settings and load required libraries
#Read in data
#Wrangle/reformat/clean/summarize data as required
#Run analyses (often multiple steps)
#Wrangle/reformat/summarize analysis outputs for visualization
#Visualize outputs as figures or tables
Version control tools can be challenging , but also hugely simplify your workflow!
The advantages of version control1:
repositories
(“repos”) or gists
(code snippets)cloning
the repo to your local PC. You can “push
to” or “pull
from” the online repo to keep versions in synccommits
commit
ed with a commit message
- creating a recoverable version
that can be compared
or reverted
forking
repos and working on their own branch
.
pull requests
owners
can accept and integrate changes seamlessly by review
and merge
the forked branch back to the main
branchcommit
or pull request
s provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming
when things go wrongassign
, log and track issues
and feature requests
Interestingly, since all that’s tracked are the commits, whereby versions are named (the nodes in the image). All that the online Git repo records is this figure below. The black is the the OWNER’s main branch and the blue is the COLLABORATOR’s fork.
Artwork by @allison_horst CC-BY-4.0
Artwork by @allison_horst CC-BY-4.0
Sharing your code and data is not enough to maintain reproducibility…
Software and hardware change between users, with upgrades, versions or user community preferences!
You can document the hardware and versions of software used so that others can recreate that computing environment if needed.
sessionInfo()
function, giving details belowR version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Africa/Johannesburg
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.5.2
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
[5] xfun_0.52 generics_0.1.4 jsonlite_2.0.0 labeling_0.4.3
[9] glue_1.8.0 htmltools_0.5.8.1 scales_1.4.0 rmarkdown_2.29
[13] grid_4.4.3 evaluate_1.0.4 tibble_3.3.0 fastmap_1.2.0
[17] yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.3 dplyr_1.1.4
[21] RColorBrewer_1.1-3 pkgconfig_2.0.3 rstudioapi_0.17.1 farver_2.1.2
[25] digest_0.6.37 R6_2.6.1 tidyselect_1.2.1 dichromat_2.0-0.1
[29] pillar_1.11.0 magrittr_2.0.3 withr_3.0.2 tools_4.4.3
[33] gtable_0.3.6
If your entire workflow is within R, you can use the renv package to manage your R environment.
renv
is essentially a package manager.
It creates a snapshot of your R environment, including all packages and their versions, so that anyone can recreate the same environment by running renv::restore()
Disadvantages are that it doesn’t manage for:
Use containers like those provided by software like docker or singularity.
Containers provide “images” of contained, lightweight computing environments that you can package with your software/workflow to set up virtual machines with all the necessary software and settings etc.
You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly every time.
Containers are usually based on Linux, because other operating systems are not free.
The Rocker project provides a set of Docker images for R and RStudio, which are widely used in the R community.
This is covered by data management, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…
Another key component here is that ideally all your data, code, publication etc are shared Open Access
::::
The key to iterating your workflow, especially for forecasting.
Many options!
targets
The project aims to develop a near-real-time satellite change detection system for the Fynbos Biome using an ecological forecasting approach (www.emma.eco).
The workflow is designed to be run on a weekly basis, with new data ingested and processed automatically.
There are several steps, each of which is run automatically:
Outputs a Quarto
website, automatically built from a GitHub repository.
Processing and analysis done in R. Intermediate and final outputs stored as GitHub releases or in GitHub Large File Storage.
R workflow managed by the targets
package
GitHub Actions
used to automate and run the workflow
Docker
container sets up the computing environment
All code, data, metadata, etc are shared on GitHub
targets
Workflowtargets
workflow from https://wlandau.github.io/targets-tutorial/#8targets
is an R package that allows you to define a workflow as a series of steps, each of which can be run automatically.
The package identifies which steps are out of date and runs them and their dependencies, but ignores unaffected steps, saving computation.
In EMMA, the workflow is defined as a series of R scripts, which is run automatically by GitHub Actions on a weekly basis, triggered by a GitHub runner. targets
keeps track and controls which steps have been run and which need to be rerun depending on new data inputs, etc.
testthat
and RUnit