2 Reproducible research

This section is also available as a slideshow. Right click or hold Ctrl or Command and click this link to view in full screen.

2.1 The Reproducibility Crisis

Replication is the ultimate standard by which scientific claims are judged.” (Peng 2011)

Replication is one of the fundamental tenets of science and if the results of a study or experiment cannot by replicated by an independent set of investigators then whatever scientific claims were made should be treated with caution! At best, it suggests that evidence for the claim is weak or mixed, or specific to particular ecosystems or other circumstances and cannot be generalized. At worst, there was error (or even dishonesty) in the original study and the claims were plainly false.

In other words, published research should be robust enough and the methods described in enough detail that anyone else should be able to repeat the study (using the publication only) and find similar results. Sadly, this is rarely the case!!!

'Is there a reproducibility* crisis?' Results from a survey of >1500 top scientists [@Baker2016; @Penny2016]. *Note that they did not discern between reproducibility and replicability, and that the terms are often used interchangeably.

Figure 2.1: ‘Is there a reproducibility* crisis?’ Results from a survey of >1500 top scientists (Baker 2016; Penny 2016). *Note that they did not discern between reproducibility and replicability, and that the terms are often used interchangeably.


We have a problem…

Since we’re failing the gentleman’s agreement1 that we’ll describe our methods in enough detail that anyone else should be able to repeat the study (using the publication only) and find similar results, modern scientists are trying to formalize the process in the form of Reproducible Research.

Reproducible research makes use of modern software tools to share data, code and other resources required to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.

Working reproducibly is not just a requirement for using quantitative approaches in iterative decision-making, it is central to scientific progress!!!


2.2 Replication and the Reproducibility Spectrum

While full replication is a huge challenge (and sometimes impossible) to achieve, it is something all scientists should be working towards.

Understandably, some studies may not be entirely replicable purely due to the nature of the data or phenomenon (e.g. rare phenomena, long term records, loss of species or ecosystems, or very expensive once-off science projects like space missions). In these cases the “gold standard” of full replication (from new data collection to results) cannot be achieved, and we have to settle for a lower rung on the reproducibility spectrum (Figure 2.2).


The Reproducibility Spectrum [@Peng2011].

Figure 2.2: The Reproducibility Spectrum (Peng 2011).


Reproducibility falls short of full replication because it focuses on reproducing the same result from the same data set, rather than analyzing independently collected data. While this may seem trivial, you’d be surprised at how few studies are even reproducible, let alone replicable.


2.3 Reproducible Scientific Workflows

'Data Pipeline' from [xkcd.com/2054](https://xkcd.com/2054), used under a [CC-BY-NC 2.5 license](https://creativecommons.org/licenses/by-nc/2.5/).

Figure 2.3: ‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.


Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.

This entails a number of overlapping/intertwined components, namely:

  • Data management - which we’ll spend more time on in Chapter 3
  • File and folder management
  • Coding and code management - i.e. the data manipulation and analyses performed
  • Computing environment and software
  • Sharing of the data, metadata, code, publications and any other relevant materials

For the rest of this section we’ll work through these components and some of the tools that help you achieve this.


2.3.1 File and folder management

Project files and folders can get unwieldy fast, and can really bog you down and inhibit productivity when you don’t know where your files are or what the latest version is.

'Documents' from [xkcd.com/1459](https://xkcd.com/1459), used under a [CC-BY-NC 2.5 license](https://creativecommons.org/licenses/by-nc/2.5/).

Figure 2.4: ‘Documents’ from xkcd.com/1459, used under a CC-BY-NC 2.5 license.


The two main considerations for addressing this issue are

  1. defining a simple, common, intuitive folder structure, and
  2. using informative file names.

Folders

Most ecological projects have similar requirements. Here’s a screenshot of how I usually (try to) manage my folders.



  • “Code” we’ll deal with in the next section, but obviously contains R code etc to perform analyses.
  • Within “Data” I often have separate folders of “Raw” and “Processed” data.
    • If the data files are big and used across multiple projects (e.g. GIS files), then they’ll often be in a separate folder elsewhere on my computer, but this is well-documented in my code.
  • “Output” contains figures and tables, often in separate folders.
  • I also often have a “Manuscript” folder if I’m working in LaTeX/Sweave, RMarkdown or Quarto, although this is often in the “Code” folder (since you can embed code in RMarkdown, Quarto and Sweave documents).


File and folder naming

Your naming conventions should be:

  • machine readable
    • i.e. avoid spaces and funny punctuation
    • support searching and splitting of names (e.g. “data_raw_CO2.csv”, “data_clean_CO2.csv”, “data_raw_species.csv” can all be searched by keywords and can be split by “_” into 3 useful fields: type (data vs other), class (raw vs clean), variable (CO2 vs species), etc)
  • human readable
    • the contents should be self evident from the file name
  • support sorting
    • i.e. use numeric or character prefixes to separate files into different components or steps (e.g. “data_raw_localities.csv”, “data_clean_localities.csv”, etc)
    • some of this can be handled with folder structure, but you don’t want too many folders either

Find out more about file naming here.


2.3.2 Coding and code management

Why write code?

Working in point-and-click GUI-based software like Excel, Statistica, SPSS, etc may seem easier, but you’ll regret it in the long run…

The beauty of writing code lies in:

  • Automation
    • You will inevitably have to adjust and repeat your analysis as you get feedback from supervisors, collaborators and reviewers. Rerunning code is one click, and you’re unlikely to introduce errors. Rerunning analyses in GUI-based software is lots of clicks and it’s easy to make mistakes, alter default settings, etc etc.
    • Next time you need to do the same analysis on a different dataset you can just copy, paste and tweak your code.
  • You code/script provides a record of your analysis
  • Linked to the above, mature scientific coding languages like Python or R allow you to run almost any kind of analysis in one scripted workflow, even if it has diverse components like GIS, phylogenetics, multivariate or Bayesian statistics, etc.
    • Most proprietary software are limited to one or a few specialized areas (e.g. ArcGIS, etc), which leaves you manually exporting and importing data between multiple software packages. This is very cumbersome, in addition to being a file-management nightmare…
  • Most scripting environments are open source (e.g. R, Python, JavaScript, etc)
    • Anyone wanting to use your code doesn’t have to pay for a software license
    • It’s great for transparency - Lots of people can and have checked the background code and functions you’re using, versus only the software owner’s employees have access to the raw code for most analytical software
    • There’s usually a culture of sharing code (online forums, with publications, etc)

Here’s a motivation and some tutorials to help you learn R.


Some coding rules

It’s easy to write messy code. This can make it virtually indecipherable to others (and even yourself), slowing you and your collaborations down. It also makes it easy to make mistakes and not notice them. The overarching rule is to write code for people, not computers. Check out the Tidyverse style guide for R-specific guidance, but here are some basic rules:

  • use consistent, meaningful and distinct names for variables and functions
  • use consistent code and formatting style
  • use commenting to document and explain what you’re doing at each step or in each function - purpose, inputs and outputs
  • “notebooks” like RMarkdown, Quarto or Jupyter Notebooks are very handy for fulfilling roles like documentation, master/makefiles etc and can be developed into reports or manuscripts
  • write functions rather than repeating the same code
  • modularise code into manageable steps/chunks
    • or even separate them into separate scripts that can all be called in order from a master script or Makefile
  • check for mistakes at every step!!! Beyond errors or warnings, do the outputs make sense?
  • start with a “recipe” that outlines the steps/modules (usually as commented headers etc). This is very valuable for keeping you organized and on track, e.g. a common recipe in R:
    • #Header indicating purpose, author, date, version etc
    • #Define settings
    • #Load required libraries
    • #Read in data
    • #Wrangle/reformat/clean/summarize data as required
    • #Run analyses (often multiple steps)
    • #Wrangle/reformat/summarize analysis outputs for visualization
    • #Visualize outputs as figures or tables
  • avoid proprietary formats
    • i.e. use an open source scripting langauge and open source file formats only
  • use version control!!!


Version control

Using version control tools like Git, SVN, etc can be challenging at first, but they can also hugely simplify your code development (and adaptation) process. While they were designed by software developers for software development, they are hugely useful for quantitative biology.

I can’t speak authoritatively on version control systems (I’ve only ever used Git and GitHub), but here are the advantages as I see them. This version is specific to Git, but I imagine they all have similar functions and functionality:

Words in italics are technical terms used within GitHub. You can look them up here. You’ll also cover it in the brief tutorial you’ll do when setting up your computer for the practical.

  • They generally help project management, especially collaborations
  • They allow you to easily share code with collaborators or the public at large - through repositories or gists (code snippets)
  • Users can easily adapt or build on each others’ code by forking repositories and working on their own branch.
    • This is truly powerful!!! It allows you to repeat/replicate analyses but even build websites (like this one!), etc
  • While the whole system is online, you can also work offline by cloning the repository to your local machine. Once you have a local version you can push to or pull from the online repository to keep everything updated
  • Changes are tracked and reversible through commits. If you change the contents of a repository you must commit them and write a commit message before pulling or pushing to the online repository. Each commit is essentially a recoverable version that can be compared or reverted to
    • This is the essence of version control and magically frees you from folders full of lists of files named “mycode_final.R”, “mycode_finalfinal.R”, “myfinalcode_finalfinal.R” etc as per Figure 2.4
  • They allow collaborators or the public at large to propose changes via pull requests that allow you to merge their forked branch back to the main (or master) branch
  • They allow you to accept and integrate changes seamlessly when you accept and merge pull requests
  • They allow you to keep written record of changes through comments whenever a commit or pull request is made - these also track the user, date, time, etc and are useful for blaming when things go wrong
  • There’s a system for assigning logging and tracking issues and feature requests

I’m sure this is all a bit much right now, but should make more sense after the practical…


2.3.3 Computing environment and software

We’ve already covered why you should use open source software whenever possible, but it bears repeating. Using proprietary software means that others have to purchase software, licenses, etc to build on your work and essentially makes it not reproducible by putting it behind a pay-wall. This is self-defeating…

Another issue is that software and hardware change with upgrades, new versions or changes in the preferences within user communities (e.g. you’ll all know MicroSoft Excel, but have you heard of Quattro Pro or Lotus that were the preferred spreadsheet software of yesteryear?).

Just sharing your code, data and workflow does not make your work reproducible if we don’t know what language the code is written in or if functions change or are deprecated in newer versions, breaking your code.

The simplest way to avert this problem is to carefully document the hardware and versions of software used in your analyses so that others can recreate that computing environment if needed. This is very easy in R, because you can simply run the sessionInfo() function, like so:

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.0
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Africa/Johannesburg
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] readxl_1.4.3    lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [6] purrr_1.0.2     readr_2.1.5     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.4  
## [11] tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.8         utf8_1.2.4         generics_0.1.3     jpeg_0.1-10       
##  [5] stringi_1.8.3      hms_1.1.3          digest_0.6.34      magrittr_2.0.3    
##  [9] evaluate_0.23      grid_4.3.2         timechange_0.3.0   RColorBrewer_1.1-3
## [13] bookdown_0.37      fastmap_1.1.1      cellranger_1.1.0   plyr_1.8.9        
## [17] jsonlite_1.8.8     GGally_2.2.0       fansi_1.0.6        scales_1.3.0      
## [21] jquerylib_0.1.4    cli_3.6.2          rlang_1.1.3        munsell_0.5.0     
## [25] withr_3.0.0        cachem_1.0.8       yaml_2.3.8         tools_4.3.2       
## [29] tzdb_0.4.0         colorspace_2.1-0   ggstats_0.5.1      png_0.1-8         
## [33] vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4    pkgconfig_2.0.3   
## [37] pillar_1.9.0       bslib_0.6.1        gtable_0.3.4       glue_1.7.0        
## [41] Rcpp_1.0.12        highr_0.10         xfun_0.41          tidyselect_1.2.0  
## [45] rstudioapi_0.15.0  knitr_1.45         farver_2.1.1       htmltools_0.5.7   
## [49] labeling_0.4.3     rmarkdown_2.25     compiler_4.3.2


Containers

A “better” way to do this is to use containers like docker or singularity. These are contained, lightweight computing environments similar to virtual machines, that you can package with your software/workflow. You set your container up to have everything you need to run your code etc (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly first time.


2.3.4 Sharing of the data, code, publication etc

This is touched on in more detail when we discuss data management in Chapter 3, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…

Another key component here is that ideally all your data, code, publication etc are shared Open Access - i.e. they are not stuck behind some paywall


A 3-step, 10-point checklist to guide researchers toward greater reproducibility in their research [@Alston2021].

Figure 2.5: A 3-step, 10-point checklist to guide researchers toward greater reproducibility in their research (Alston and Rick 2021).


2.4 Why work reproducibly?

Let's start being more specific about our miracles... Cartoon © Sidney Harris. Used with permission [ScienceCartoonsPlus.com](www.ScienceCartoonsPlus.com)

Figure 2.6: Let’s start being more specific about our miracles… Cartoon © Sidney Harris. Used with permission ScienceCartoonsPlus.com


In addition to basic scientific rigour, working reproducibly is hugely valuable, because:

(Adapted from “Five selfish reasons to work reproducibly(Markowetz 2015))

  1. Its transparent and open, helping us avoid mistakes and/or track down errors in analyses
  • This is what highlighted the importance of working reproducibly for me. In 2017 I published the first evidence of observed climate change impacts on biodiversity in the Fynbos Biome (Slingsby et al. 2017). The analyses were quite complicated, and when working on the revisions I found an error in my R code. Fortunately, it didn’t change the results qualitatively, but it made me realize how easy it is to make a mistake and potentially put the wrong message out there! This encouraged me to make all data and R code from the paper available, so that anyone is free to check my data and analyses and let me (and/or the world) know if they find any errors.
  1. It makes it easier to write papers
  • e.g. Dynamic documents like RMarkdown or Jupyter Notebooks update automatically when you change your analyses, so you don’t have to copy/paste or save/insert all tables and figures - or worry about whether you included the latest versions.
  1. It helps the review process
  • Often issues picked at by reviewers are matters of clarity/confusion. Sharing your data and analyses allows them to see exactly what you did, not just what you said you did, allowing them to identify the problem and make constructive suggestions.
  • It’s also handy to be able to respond to a reviewer’s comment with something like:
    • “That’s a great suggestion, but not really in line with the objectives of the study. We have chosen not to include the suggested analysis, but do provide all data and code so that interested readers can explore this for themselves.” (Feel free to copy and paste - CCO 1.0)
  1. It enables continuity of the research
  • When people leave a project (e.g. students/postdocs), or you forget what you did X days/weeks/months/years ago, it can be a serious setback for a project and make it difficult for you or a new student to pick up where things left off. If the data and workflow are well curated and documented this problem is avoided. Trust me, this is a very common problem!!! I have many papers that I (or my students) never published and may never go back to, because I know it’ll take me a few days or weeks to understand the datasets and analyses again…
  • This is obviously incredibly important for long-term projects!
  • A little bit of extra effort early on can save a lot of time further down the road!!!
  1. It helps to build your reputation
  • Working reproducibly makes it clear you’re an honest, open, careful and transparent researcher, and should errors be found in your work you’re unlikely to be accused of dishonesty (e.g. see my paper example under point 1 - although no one has told me of any errors yet…).
  • When others reuse your data, code, etc you’re likely to get credit for it - either just informally, or formally through citations or acknowledgements (depending on the licensing conditions you specify - see “Preserve” in the Data Life Cycle).

And some less selfish reasons (and relevant for ecoforecasting):

  1. It speeds progress in science by allowing you (or others) to rapidly build on previous findings and analyses
  • Somewhat linked to point 4, but here the focus is on building on published work. For example, if I read a paper and have an idea (or develop a new hypothesis) that may explain some of their results or add to the story, I can start from where they left off rather than collecting new data, recoding their whole analysis, etc before I can even start.
  1. It allows easy comparison of new analytical approaches (methods, models, etc) to older ones
  • Linked to 6, but more specific to model or methods development where the need to “benchmark” your new method relative to an older one is important. If the benchmark model exists you can make an honest comparison, but if you have to set up the older one yourself, some may accuse you of gaming the system by chosing settings etc that advantage your new method.
  1. It makes it easy to repeat the same analyses when new data are collected or added
  • This is key for iterative forecasting, but also useful if you want to apply the same method/model to a new data set (e.g. applying Merow et al. (2014)’s analysis on all 26 Proteaceae studied by Treurnicht et al. (2016)).
  1. Most skills and software used in Reproducible Research are very useful beyond Reproducible Research alone!
  • e.g. GitHub - in addition to versioning code etc is great for code management, collaborative projects and can be used for all kinds of things like building websites (e.g. these course notes).

And one more selfish reason (but don’t tell anyone I said this):

  1. Reproducible research skills are highly sought after in careers like data science etc…
  • Skills are important should you decide to leave biology…
  • Even within biology, more and more environmental organizations and NGOs are hiring data scientists or scientists with strong data and quantitative skills. Some examples I know of:


2.5 Barriers to working reproducibly

(Adapted from “A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021))

1. Complexity

  • There can be a bit of a learning curve in getting to know and use the tools for reproducible research effectively.
  • One is always tempted by the “easy option” of doing it the way you already know or using “user-friendly” proprietary software.

2. Technological change

  • Hardware and software used in analyses change over time - either changing with updates or going obsolete altogether - making it very difficult to rerun old analyses.
  • This should be less of a problem going forward because:
    • it is something people are now aware of (so we’re working on solutions)
    • we’re increasingly using open source software, for which older versions are usually still made available and there is little risk of it disappearing when the software company stops supporting the software or goes bankrupt
    • documenting hardware and software versions with analyses is an easy baseline
    • increasingly people are using contained computing environments as we’ll discuss below

3. Human error

  • Simple mistakes or failure to fully document protocols or analyses can easily make a study irreproducible.
  • Most reproducible research tools are aimed at solving this problem.

4. Intellectual property rights

  • Rational self-interest can lead to hesitation to share data and code via many pathways:
    • Fear of not getting credit; Concern that the materials shared will be used incorrectly or unethically; etc
  • Hopefully most of these issues will be solved by better awareness of licensing issues, attribution, etc, as the culture of reproducible research grows


References

Alston, Jesse M, and Jessica A Rick. 2021. A beginner’s guide to conducting reproducible research.” Bulletin of the Ecological Society of America 102 (2). https://doi.org/10.1002/bes2.1801.
Baker, Monya. 2016. 1,500 scientists lift the lid on reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Markowetz, Florian. 2015. Five selfish reasons to work reproducibly.” Genome Biology 16 (December): 274. https://doi.org/10.1186/s13059-015-0850-7.
Merow, Cory, Andrew M Latimer, Adam M Wilson, Sean M McMahon, Anthony G Rebelo, and John A Silander Jr. 2014. On using integral projection models to generate demographically driven predictions of species’ distributions: development and validation using sparse data.” Ecography 37 (12): 1167–83. https://doi.org/10.1111/ecog.00839.
Peng, Roger D. 2011. Reproducible research in computational science.” Science 334 (6060): 1226–27. https://doi.org/10.1126/science.1213847.
Penny, Dan. 2016. Nature Reproducibility survey,” May. https://doi.org/10.6084/m9.figshare.3394951.v1.
Slingsby, Jasper A, Cory Merow, Matthew Aiello-Lammens, Nicky Allsopp, Stuart Hall, Hayley Kilroy Mollmann, Ross Turner, Adam M Wilson, and John A Silander Jr. 2017. Intensifying postfire weather and biological invasion drive species loss in a Mediterranean-type biodiversity hotspot.” Proceedings of the National Academy of Sciences of the United States of America 114 (18): 4697–4702. https://doi.org/10.1073/pnas.1619014114.
Treurnicht, Martina, Jörn Pagel, Karen J Esler, Annelise Schutte-Vlok, Henning Nottebrock, Tineke Kraaij, Anthony G Rebelo, and Frank M Schurr. 2016. Environmental drivers of demographic variation across the global geographical range of 26 plant species.” Edited by Roberto Salguero-Gómez. The Journal of Ecology 104 (2): 331–42. https://doi.org/10.1111/1365-2745.12508.

  1. I know many may find the use of this term offensive. In fact, I have used it here because that offense highlights my point. “Gentlemen’s agreements” have been used for nefarious purposes since the dawn of time should have no place in science as we strive to make the discipline more transparent, open and inclusive.↩︎