1 Overview

This is a minimal introduction to reproducible research, including data management and handling data in R, compiled for the Biological Sciences BSc(Honours) class at the University of Cape Town.

1.1 General

This really is a minimalist introduction. We only have a week! I’ll focus on providing a broad overview of the general framework and motivation for reproducible research (including good data management), teaching a few practical skills along the way.

Mostly this is not fun and exciting, but it is important stuff for any biologist to know. I’ll try my best to make it interesting! Hopefully by the end of the module you’ll see the value in it all - both for you as an individual and for science and society in general.

“Let us emphasize again this obvious conclusion: a scholar’s positive contribution is measured by the sum of the original data that [they] contributes. Hypotheses come and go but data remain. Theories desert us, while data defend us. They are our true resources, our real estate, and our best pedigree. In the eternal shifting of things, only they will save us from the ravages of time and from the forgetfulness or injustice of [people]. To risk everything on the success of one idea is to forget that every fifteen or twenty years theories are replaced or revised. So many apparently conclusive theories in physics, chemistry, geology, and biology have collapsed in the last few decades!” - Santiago Ramón y Cajal, 1906 Nobel Laureate, from Advice for a Young Investigator 1898 (Ramón y Cajal 1999)

The core outcomes/concepts I hope you’ll come away with:

  • Familiarity with the concepts and understand the need for Open, Reproducible Science
  • Familiarity with The Data Life Cycle
  • Some data management and handling skills

1.2 Lectures/Discussions/Tutorials

These will be held live in person in BIO LT1 from 10AM to 12PM from the 4th to the 7th April unless otherwise announced on Vula.

I’ll be adding to (and mostly teaching from) these online course notes as we go along.

The schedule of lectures (and readings) is as follows:

Friday we head off on the field trip…

1.3 Deliverables (Due Thursday the 9th March)

  1. A draft Data Management Plan (DMP) for your Honours Project using UCT’s Online DMP Tool. Please use the “University of Cape Town (UCT) - Full DMP” template.

  2. A GitHub repository containing suitably named sub-folders, data files (if small) and the R scripts (that you’ll develop based on Wednesday’s tutorial), all in line with best practice as per the content of this module.

  • More about the R script during the Tidy data tutorial, but it must be easily executable by one of your classmates, and output your data in tidy format and a summary figure of some kind.
  • Since you may not want your data to be public, it is best to create a private repository and invite me as a collaborator. If your data are large (>5MB), then it’s best to create and only upload a smaller subset of the data. Note that your R script must still work with the reduced dataset, since part of your mark will be based on whether your script runs and the output reproducible.

1.4 Software installation and setup

For the data wrangling exercise and the second deliverable, we’ll be using the R statistical programming language and the Git version control system. We’ll also be using an integrated development environment (IDE) for each: RStudio and GitHub, respectively.

If you already have these installed and set up, please make sure you have the latest versions, and check that your installations are working! Please also make sure you have installed (and/or updated) the Tidyverse set of R packages. It can be installed using the code install.packages("tidyverse") and updated using update.packages("tidyverse").

The installation and setup can be a bit long-winded, but once done you should be good to go until you change or reformat your computer. The steps below are my summary and (hopefully) more intuitive adaptation of the instructions provided for setting up GitHub and version control with R. If my steps don’t work its probably best to read up there.


First we’ll start with the necessary software.

  1. Download and install the latest version of R
  2. Download and install the latest free version of RStudio Desktop
  3. Download and install the latest version of Git - accept all the defaults

Then get started with GitHub:

  1. Create a GitHub account
  2. Run through the 10 minute GitHub tutorial that is offered when you activate your GitHub Account (It’ll really help you get the idea behind what Git does!)

Now you have RStudio, R and Git installed, and you have a working GitHub account that lets you do stuff online, but what remains is to get GitHub working locally and configuring RStudio to use GitHub.

  1. Install GitHub CLI (Command Line Interface). For Windows you can download the installer here
  2. Open RStudio.
  • Select the Terminal tab (top left, next to Console)
  • Enter gh auth login, then follow the prompts:
    • Select GitHub.com
    • When prompted for your preferred protocol for Git operations, select HTTPS
    • When asked if you would like to authenticate to Git with your GitHub credentials, enter Y
    • When asked how you would like to authenticate select Login with web browser
    • Copy the 8-digit code and hit Enter
    • Github.com will open in your internet browser - paste the code and hit enter
    • If any of these steps don’t work, just start again with gh auth login in Terminal
  1. In RStudio
  • Go to Global Options (from the Tools menu)
  • Click Git/SVN
  • Make sure Enable version control interface for RStudio projects is on
  • If necessary, enter the path for your Git or SVN executable where provided (this shouldn’t be needed, but may)
  • Click Apply
  • Restart RStudio

If that hasn’t worked, have a look at the installation section of Happy Git with R to troubleshoot…

Lastly, you need to install the Tidyverse set of R packages. This can be done using the code install.packages("tidyverse").

References

Baker, Monya. 2016. 1,500 scientists lift the lid on reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Markowetz, Florian. 2015. Five selfish reasons to work reproducibly.” Genome Biology 16 (December): 274. https://doi.org/10.1186/s13059-015-0850-7.
Michener, William K, and Matthew B Jones. 2012. Ecoinformatics: supporting ecology as a data-intensive science.” Trends in Ecology & Evolution 27 (2): 85–93. https://doi.org/10.1016/j.tree.2011.11.016.
Peng, Roger D. 2011. Reproducible research in computational science.” Science 334 (6060): 1226–27. https://doi.org/10.1126/science.1213847.
Ramón y Cajal, Santiago. 1999. Advice for a young investigator. The MIT Press. https://doi.org/10.7551/mitpress/1133.001.0001.
Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy McGowan, Romain François, Garrett Grolemund, et al. 2019. Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.