5 Further Reading
Here are some extra resources researched and shared by the class.
To add a section, please:
- Go to the repository for this web-book on GitHub.
- Click on the “Fork” button towards the upper right to make your own copy of the repository in your own GitHub profile.
- In your copy of the repository on GitHub (online), open “06-futherreading.Rmd”.
- Copy (don’t just edit!) the example text* between the lines below, paste it below the lines, and replace with your link, name, text etc. You don’t have to add your name if you don’t want to, as long as I know your Git username.
- Save and add a
commit
message along the lines of “Added new resource under further reading” - Next you want to perform a “pull request”, which will send a request to me to pull your new code into the main (my) version of the repository.
- From your Github repo (not mine!), click on
Pull requests
(top left) and hit the greenNew Pull Request
button. - Follow the instructions, creating a title, message, and confirming that you want to create the pull request (you may be asked to confirm a couple of times).
That should be it!
*Note: For the example below, I copied, pasted and minimally edited text directly from the resource. This is ok, because they shared the resource under an MIT licence and I’m acknowledging them, but for this exercise I would like you to use your own words as far as possible.
5.1 library(osfr) for interacting with the Open Science Framework (OSF) from R
library(osfr) is an R package that allows you to interact with OSF directly from R. OSF is a free and open source project management repository designed to support researchers across their entire project lifecycle. The service includes unlimited cloud storage and file version history, providing a centralized location for all your research materials that can be kept private, shared with select collaborators, or made publicly available with citable DOIs. You can use library(osfr) to explore publicly accessible projects and download the associated files — all you need to get started is the project’s URL or GUID (global unique identifier). library(osfr) can also be used for project management by creating projects, add sub-components or directories, and uploading files.
Contributor: Jasper
5.2 Lisa DeBruine and Dale Barr (2021) “Data Skills for Reproducible Research”
Producing reproducible research requires learning certain skills and tools which allow for efficiently managing your data and record details of your analysis. DeBruine and Barr first show how to create a project in R Studio and the benefits for your reproducible workflow in doing so as you have a clear filing system which makes reading in data easy, for example. Secondly, an introductory tutorial on R Markdown is covered including how to embed exploratory visual analyses, such as figures, and tables, and preliminary analysis results in your R Markdown script. Finally, citing statistical packages is covered, which is essential in making the process of your data analysis more transparent and easier to replicate.
Zenodo DOI: 10.5281/zenodo.6527195, or see https://psyteachr.github.io/reprores-v2/repro.html for more.
Contributor: Ash S
5.3 Reproducibility trial: 246 biologists get different results from same data sets
In this article by Anil Oza, they speak on a paper doing research on the reproducability of ecological data. The paper (which is under peer review) showed that 246 different scientist got widely different results on the same research question and data set. This gives rise to their question of what can be trusted as a true and proper analysis with a correct result. Most of it boils down to not having a structure in place of how to analyse the data provided and different researchers having their own way to analyse the data. Having power-analyses to find the correct sample sizes and other standardised ways of reviewing the data that is not part of our field necessarily, like robustness tests used in economics, is a part of their solution. This article and paper further cements the importance of reproducability and its practices.
Link to article: https://www-nature-com.ezproxy.uct.ac.za/articles/d41586-023-03177-1
Contributor: Judi
5.4 Conference Talk on Building Reproducible and Replicable Projects
This YouTube video shows a talk by Dan Chen at the New York 2019 Conference. He gives a very short but concise introduction into reproducibility vs replicability. More importantly he also talks about creating files in such a way to aid in reproducibility (Key Moment: File Inputs), this will also aid you in keeping track of your data. He also gives some useful advice on how to draw out your scripts in a concise way (Key Moment: Draw your build scripts). His talk highlights that reproducible data is not only good scientific practice but creates an environment that is easier for you to manage in the long term.
Contributor: Sarah Frances Visser
5.5 British Ecological Society (2014) “A Guide to Data Management in Ecology and Evolution”
The Guide to Data Management in Ecology and Evolution is a comprehensive resource for early researchers and focuses on the importance of data management in producing high-quality, easily accessible, and reusable research data. It explains what data and data management is and why it is important as well as provides advice and examples of good data management practices. It covers the various stages of the data lifecycle, including planning, creating, processing, documenting, preserving, sharing, and reusing data. The guide emphasizes the growing importance of collaborative and interdisciplinary research, highlighting the British Ecological Society’s (BES) efforts in promoting data archiving and accessibility. It also addresses challenges and best practices in data collection, digitization, processing, documentation, preservation, and sharing of data. The guide also addresses ethical aspects, such as intellectual property rights, licenses, and permissions related to the reuse of data. It provides practical advice on file naming, assuring high-quality data, processing the data, version control, documenting the data, creating metadata, using non-generic file formats, and organizing data effectively.
See: https://www.britishecologicalsociety.org/wp-content/uploads/Publ_Data-Management-Booklet.pdf
Contributor: Melissa Malan
5.6 The British Ecological Society (2017) “A Guide to Reproducible Code in Ecology and Evolution”
The guide, created by the British Ecological society, is a fully comprehensive practical guide on achieving reproducibility in ecological and evolutionary research. The guide covers a wide range of topics, from the initial steps needed to create a project workflow, to organising projects, programming, version control and report writing. This guide makes use of clear and easy to follow instructions, “do’s and don’t’s” and examples of code, naming files or documents, citations etc. These recommendations and tools are tailored to specific challenges and contexts of data analysis and report writing within ecological and evolutionary research. Where necessary, clear and concise definitions are provided for the reader. These help to unpack the foreign and often confusing terms that come with learning new software like Git. This guide is particularly helpful in explaining how different packages can be beneficial in streamlining the entire report writing process. For example, finely controlling how figures are generated and placed within your report using the package “knitr”.
See https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf for more.
Contributor: Robyn
5.7 Six factors affecting reproducibility in life science research and how to handle them
https://www.nature.com/articles/d42473-019-00004-y
The advertisement feature gives a good overview of the topic of reproducibility. It starts by outlining what reproducibility is, and how it has been problematic in science since the majority of published papers are currently not reproducible. The article then explains several factors, that cause the lack of reproducibility. These include the inaccessibility of methodological details, the use of misidentified of contaminated samples in microbiology, the difficulty many researchers have with managing complex datasets, bad experimental designs and research practices, cognitive bias, and the pressure to publish novel and extraordinary findings. The article then goes on to recommend ways that help in producing reproducible research, and outlines the work of several initiatives, that address and support the issues in research reproducibility.
Contributor: Svenja
5.8 Braga et al. (2023) “Not just for programmers: How GitHub can accelerate collaborative and reproducible research in ecology and evolution. Methods in Ecology and Evolution”
https://doi.org/10.1111/2041-210X.14108
This article provides practical guidance for researchers in the field Ecology and Evolutionary biology on how to use GitHub. In the article, GitHub is described as “the most used web-based platform which has features for more collaborative, transparent, and reproducible research”. Furthermore, the article outlines features from low to high technical difficulties one might experience with using GitHub including, code and coding management, writing a manuscript, conducting peer review, using automated and continuous integration to streamline analyses, which allows for early detection and correction of errors, potentially improving confidence in scientific development. In addition, the article provides definitions for basic concepts of Git (such as commit, push, pull, fork, repository etc), for users who are less familiar with software terms. The article also provides relevant information like how GitHub limits committed file sizes to 100 megabytes, meaning that a person may require external file storage alternatives. Furthermore, how long-term management of laboratory materials tool such as notebooks etc can also be done within GitHub. Finally, how GitHub also allows for centralized communication for all project-related conversation and planning for a team, rather than emailing, which makes it easier to keep track of progress throughout the lifespan of a project.
Open Science Framework repository (accessible at https://osf.io/bypfm/; Braga et al., 2023) and GitHub repository (accessible at https://github.com/SORTEE-Github-Hackathon/manuscript) for this study.
Contributor: Ongezile Mpehle
5.9 Grames, E.M. and Elphick, C.S., 2020. Use of study design principles would increase the reproducibility of reviews in conservation biology. Biological Conservation
https://doi.org/10.1016/j.biocon.2019.108385
This scientific article highlights the dangers of biological reviews if not conducted in a reproducible way. Biological reviews are very useful when researching to gain an overview of the topic and the trends seen worldwide. However, these reviews need to be taken with a pinch of salt as one should conduct a thorough assessment of the methods before taking the review into account. Grames and Elphick (2020) suggest five steps to conduct a review to have the same reproducible standard that experiments and field studies should have. The first step is “searching for studies” and most reviews they found do not explain how they found the studies used in the review. This highlights the potential for selection bias based on familiarity, access, and time when the search was done. Trying to repeat a review without the full methods of how the search was conducted would be impossible. The second step is “selecting studies for inclusion” as reviews must define their selection criteria. Without defining these criteria clearly, selection bias can occur based on familiarity or choosing studies that agree with the result that the reviewer wishes to achieve. The third step is “critically appraising the quality of included studies” as different studies are carried out to a different standard and thus have varying strength of evidence. The fourth step is “coding studies and extracting data”. Variables need to be defined, just like in regular studies. The fifth step is “synthesizing results” and many reviews do not explain how the results are synthesized leading these reviews to be irreproducible. This paper ends off by emphasizing the need for biological reviews to have a standardised format to ensure reproducibility.
I chose this paper as I have read and used many reviews in projects and research tasks. It makes me question their methods and how one must be careful when using a review paper. They are still valuable but need to be well scrutinised.
Constributor: Hannah Glanvill
5.10 Kellie, D. and Westgate, M. (2023) Improving code reproducibility: Small steps with big impacts, Research Communities by Springer Nature.
https://communities.springernature.com/posts/improving-code-reproducibility-small-steps-with-big-impacts (Accessed: 12 February 2024).
This is a short blog about reproducible research. It mentions how despite the push to make data openly available, these studies are still not reproducible. Despite tools being developed to aid improving reproducibility, the authors argue that alongside an academics busy schedule, it doesn’t seem realistic to go through the time and effort to learn these tools. Subsequently, the authors find it frustrating how learning these tools are often the solution provided by papers to aid reproducibility. As creators of such tools to aid the reproducibility of the Atlas of Living Australia, the authors have learnt much about code reproducibility and share three achievable steps to increase reproducibility: have a shareable work environment (i.e. use code and repositories); ensure you code and notes are understandable and lastly, save your code (including wrangling steps) as a document, preferably with Quarto. The latter for sharing, efficiency and if code breaks.
This blog raised the relevant issue of not knowing where to start to improve reproducibility. There are many suggestions before, during and after data collection, with many aided with complex tools. This can quickly become overwhelming, causing one to avoid reproducibility altogether. To avoid this, three straightforward steps within the coding process were presented, offering a practical starting point to enhancing reproducibility. Additionally, this blog being published in Research Communities, a forum for people interested in research, enhances its importance. Raising awareness to research-orientated people will help to create a culture of reproducibility.
Contributor: Lauren Schoeman
5.11 Powers et al. (2019) “Open science, reproducibility and transparency in ecology”
https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/eap.1822
Powers et al. (2019) outline “open science” as the best standard both for individual researchers and broader science. Open science they define as the sharing of code, data and metadata and continuous public communication, which roughly coincides with the definition of reproducible research in other papers (e.g. Peng 2011) although Powers et al. (2019) stresses the use of social media and online forums. This allows “peer review on the fly” they argue, which makes the final study more rigorous than it would otherwise be. They note the difficulties with reproducing ecological studies and the hesitancy to publicize the location of threatened species but note that the benefits outweigh the costs. These general benefits are: faster collaborations and sharing of ideas, avoiding time-wasting exercises and building public trust in governmental research agencies. Personal benefits include high citations and there is a strong tendency for the foremost scientists in any field to be the most prolific data sharers.
Overall, Powers et al. (2019) is similar to other papers on the subject, although they are the first I have seen to emphasize public feedback during the research. This has potential to be very useful but premature exposure to the public runs the risk of circumventing the peer review process. Care should be taken to ensure continuous feedback is from appropriate sources.
References:
- Peng, R.D., 2011. Reproducible research in computational science. Science, 334(6060), pp.1226-1227.
- Powers, S.M. and Hampton, S.E., 2019. Open science, reproducibility, and transparency in ecology. Ecological applications, 29(1), p.e01822.
Contributor: Ben W
5.12 Zuduo Zheng (2021) “Reasons, challenges, and some tools for doing reproducible transportation research”
https://doi.org/10.1016/j.commtr.2021.100004
This paper outlines the context for reproducible research, and many of the challenges we have already discussed. However, it also outlines some key benfits to both the researcher and the readers. Zheng states that making research reproducible can elevate your reputation as an academic, and often comes with more citations. Once in the habit of working reproducibly, it also helps you work efficiently while reducing the probability of making critical mistakes. Benefits to the reader are that they can trust the research, it is easy to reproduce the analysis and this creates a more productive work flow.
Contributor: Mia S
5.13 TED Talk about Reproducibility Crisis
For those who grasp concepts better with videos - this TED Talk goes over the Baker (2016) paper and the reproducibility crisis.
https://www.youtube.com/watch?v=FpCrY7x5nEE&t=190s&ab_channel=TED-Ed
Contributor: Emm
5.14 S. M. Powers, and S. E. Hampton. 2019. Open science, reproducibility, and transparency in ecology. Ecological Applications 29(1):e01822. 10.1002/eap.1822
The paper discusses the challenge ecologists face in replicating observational studies due to the ever-changing natural world and evolving perceptions. While perfect replication may be unattainable, achieving computational reproducibility—generating equivalent analytical outcomes using the same data and code—promotes transparency, crucial for maintaining public trust. Open data facilitates transparency and reproducibility, though rapid scientific progress is often the primary incentive for open science. Making data and code openly accessible allows scientists to build upon previous research and address new questions. The adoption of open research practices is increasing, often driven by ethical, moral, or practical considerations. It’s anticipated that more journals will encourage or mandate code archiving in their data policies. The paper emphasises the diminishing excuses for not sharing data and code in publicly accessible archives. As more ecologists engage in open science, tools supporting this transition, such as data management, analysis, and visualisation methods, are proliferating and becoming central to ecological culture.
Contributor: Sarah
5.15 Ellison (2010) “Repeatability and transparency in Ecological research”
A fundamental in science is that results must be reproducible before they are accepted as fact. However, the context-dependent nature of ecology makes perfect replication virtually impossible. One way to make ecological studies easy to reproduce is through the assembly of derived data sets and their subsequent analysis, this is known as ecological synthesis. The key to ecological synthesis is transparency and full disclosure of all aspects of the study. However, despite mandates from funding agencies that data be made publicly available, raw data are not always easily accessed. Fortunately, data repositories are growing and with descriptive metadata – a method for describing ecological data – data is made easier to interpret and re-use. Once obtained, a data set will require a certain degree of transformation after which it is ready for analysis. It is essential that researchers clearly document steps taken to derive the data as well as record all statistical tools used on the data to ensure reproducibility. Such a record should be a requirement for publication of any meta-analysis. Unfortunately, there are no standards for process meta-data and open-source software tools, while attractive, are constantly evolving. Moving forward, it appears as though more ecologists are recognising that sharing of data benefits the entire scientific enterprise and results in successful collaboration. This has resulted in rapid development of data and sharing tools.
https://www.jstor.org/stable/27860827
Contributor: Cailin
5.16 Meng, X. 2020. Reproducibility, Replicability and Reliability
This Article highlights how reproducibility outweighs replicability in terms of the reliability of reported results. The author also brings attention to the issue of the lack of profesional communities focusing on reproducibility verification. The author goes on to argue that reproducibility must adhere to strict rules, while replicability has more felxibility due to inherant uncertainties and possible confounding factors. Meng (2020) also explains why reproducibility is more desireable than replicability in the sense that replicability does not focus on improving upon previous studies, while reproducibility allows for permutations and future improvements.
Contibutor: Shane
5.17 White et al. (2015) “Reproducible research in the study of biological coloration”
https://doi.org/10.1016/j.anbehav.2015.05.007
According to the paper, a study is considered truly reproducible when methods are completely reported, data are archived and publicly available, and every modification of raw data is documented and preserved. In the field of biological coloration, the direct measure of reflectance through spectrometry has been adopted as the standard, however digital photography to quantify color is increasingly gaining popularity. The paper’s guidelines for reporting color analyses details cover the methods of spectral analyses and visual modelling, which are included in metadata and are essential for reproducibility. To assess reporting and reproducibility in the field of biological coloration, the authors searched through journals using phrases such as ‘color’ or ‘spectra’. They also recorded whether the data and/or code were openly available and this information, along with their analysis script, was stored in github. After their survey, they found that there’s inconsistency and incompleteness in commonly reported methodological details from the papers. Some failed to report details such as distance and pixel numbers. They also found that 38% of papers made reference to previous work for details, which were often incomplete as well. Due to this, they suggested reporting all details along with current manuscript. Of the studies analysed, 1.7% openly provided raw data, 31.7% provided processed data, and 66.7% did not provide data. Overall, there’s evidence that this trend is shifting as funders and publishers mandate data release.
Contributor: MuanoR
5.18 Voelkl et al. (2020) Reproducibility of animal research in light of biological variation
DOI: 10.1038/s41583-020-0313-3
This paper defines reproducibility as “The ability to produce similar results by an independent replicate experiment using the same methodology in the same or a different laboratory.” Although this definition varies from the one we have been using, the paper’s contents deal largely with the reproducibility crisis (particularly in animal research) and offer a potential cause and solution. Scientists often conflate biological variation to noise. They therefore try to homogenise their sample group in order to increase power and thereby use less animals. This approach is typically deemed more ethical. However, homogenisation makes it difficult for future studies to reproduce results, as variation is integral to biology. More tests would therefore have to be performed, which will require more animals in the long run. The authors therefore recommend deliberately including relevant biological variation in the sample group. It is vital for scientists to record every decision and process clearly so that studies may be successfully reproduced in the future.
Contributor: Ashleigh Gouws
5.19 Reproducible Research and Public Data-Sharing in Ecology: A Q&A with LAGOS Creators
A recent study published in GigaScience introduces LAGOS (LAke multi-scaled GeOSpatial and temporal database), an open database compiling hundreds of ecological datasets. LAGOS supports macrosystems ecology research, which analyses ecological interactions at regional and continental scales. Professor Pat Soranno of Michigan State University introduces the motivations behind LAGOS, emphasizing the need to study large ecological patterns, such as freshwater ecosystems’ role in global carbon cycles, however, A key challenge in ecology is “dark data”, when valuable datasets are trapped on individual researchers’ computers, making it inaccessible. LAGOS therefore, addresses this problem by integrating diverse datasets into a reproducible, publicly available resource. By making data more accessible, LAGOS enables scientists to ask large-scale, impactful ecological questions and contribute to a growing movement of data-sharing in environmental research.
Contributor: Robyn
5.20 “Wildlife Biology, Big Data, and Reproducible Research” by Keith P. Lewis, Eric Vander Wal and David A. Fifield (2018)
Historically, the field of wildlife biology has been regarded as data poor, and certain study topics, like animal movement, were not even regarded as part of mainstream ecology because of the lack of data. However, technological advancements (e.g., GPS tags and camera traps) have allowed scientists to produce larger, more accurate data sets, which allow wildlife biologists to conduct analyses over large spatial and temporal scales. This has shifted wildlife biology into the realm of “big data”. However, these changes also mean that the way in which wildlife biologists need to manage, analyse and store data must change.
The following paper by Lewis et al. (2018) provides a conceptual framework for reproducible research in wildlife biology. They also present case studies which further emphasise the importance and advantages of using reproducible research practices in the field.
The authors conclude that with the advances in the complexity of technology and studies requires a simultaneous change in the way wildlife biologists conduct research and teach others to conduct research.
References LEWIS, K. P., VANDER WAL, E., & FIFIELD, D. A. (2018). Wildlife Biology, Big Data, and Reproducible Research. Wildlife Society Bulletin (2011), 42(1), 172–179. https://doi.org/10.1002/wsb.847
Contributor: Muhammad Uzair Davids
5.21 Reproducible research can still be wrong: Adopting a prevention approach (Leek & Peng, 2015)
This article seeks to look beyond the problem of reproducible research and makes an appeal for greater education in actual data analysis and study design. Essentially it highlights that although reproducible research is very important in diagnosing faults within the methodology / data / analysis it doesn’t fix the faulty thinking that created the problem. From the perspective of the authors reproducibility is a “treatment” against “poor data analyses” (Leek & Peng, 2015) but it is not a preventative measure. With increasingly complicated data and analyses, and higher rates of paper submissions to journals, even if there is transparency and full reproducibility mistakes may not be intercepted in time to prevent negative effects. They propose scaling up data science education through massive online open courses (MOOCs) and crowed sourced short courses in data analysis. These will be focused on teaching methods which are the most reproducible and replicable for the novice data analyst. Part of the challenge in implementing this is the need to find the current most efficient statistical methods, analyses types and software for the job. The authors call this method “evidence-based data analysis”.
J.T. Leek, & R.D. Peng, Reproducible research can still be wrong: Adopting a prevention approach, Proc. Natl. Acad. Sci. U.S.A. 112 (6) 1645-1646, https://doi.org/10.1073/pnas.1421412111 (2015).
Contributor: Nic van Tol
5.22 Ocean FAIR Data Services (Tanhua et al., 2019)
Well-established data management systems are crucial for ocean observing systems as they ensure essential data is collected, preserved, and accessible for current and future users. Effective management involves collaboration in various areas such as observations, data and metadata assembly, quality assurance/control, and publication. To promote long-term usability, ocean data should follow the FAIR principles—Findable, Accessible, Interoperable, and Reusable—which help transition unstructured data into organized, standardized datasets. As ocean observing systems evolve, new autonomous platforms collect more essential ocean variables, necessitating modern data infrastructures to automate the data life cycle. However, several challenges exist in implementing FAIR data management, including the diverse oceanographic data; wide range of different data management structures; high data volumes; new sensor formats; formats not universally applicable; gaps between data producers and end-users; slow development of common protocols; and poorly defined best practices. The paper discusses solutions to these challenges, enabling scientists and developers to download data and apply traditional analysis techniques. This also allows for the utilization of modern tools to transform, manipulate, visualize, and innovate with data. The solutions promote reusability by making sure data can be understood by those who did not produce the data. It is essential to explain the importance and implementation of FAIR data principles to early career marine scientists. This understanding will help ensure that these principles and effective data management practices become second nature and an integral part of ocean observations.
Link: https://doi.org/10.3389/fmars.2019.00440
Contributor: Kayla Heuer
5.23 The Tao of open science for ecology; a path towards effective data management and reproducibility.
Hampton et al. (2015) logically and enthusiastically advocate for a shift in mindset and practice towards ‘open science’ in ecology. They propose a series of targeted approaches based on data stewardship, transparency in the data-life-cycle, using open-source tools, and clearly documenting the scientific process of discovery. In the context of data management and reproducibility, transparency allows scientists to standardize code, data and metadata storage and archiving throughout the data-life cycle. Clear documentation throughout the scientific process enables reproducibility and replication, and creates space and material for discussion, constructive feedback, and collaboration. Ultimately the Tao of open science, advocates for practices, tools and methods of scientific inquiry that depart from the “exclusive silos of traditional journal archives”, and instead create open-access, easily machine readable, logically structured and standardized repositories of data, publications and code. This shift will benefit current and future scientists pushing the frontiers of discovery, as it encourages, and even demands careful and consistent critical thinking, organization and documentation across every scientific endeavour.
Link: https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/ES14-00402.1
Contributor: Arwenn