Code Management Guidelines

R and GitHub Starter Kit for New Team Members

Author

Florencia D’Andrea

Published

December 23, 2024

GitHub Organization: www.github.com/StringhiniLab

Goal

The goal of this manual is to provide the minimum necessary guidelines for new members of Dr. Silvia Stringhini’s lab to follow agreed-upon practices in code management.

Introduction

The use of programming languages has become an essential part of data analysis for most researchers today. In this context, a basic skill set in computer science is key to ensuring reliable and reproducible results (Wilson et al. 2017; Hicks 2023; Abdill et al. 2024). Although a variety of educational materials, tutorials, and recommended practices specifically designed to train researchers are available (The Carpentries; Our Coding Club; The Turing Way Community; CodeRefinery Project; Sherman Center Workshops 2024), there is a trade-off: adopting and practicing these techniques often requires significant effort, taking time away from researchers’ primary fields of study (Allen and Mehler 2019; Goldsmith et al. 2021; Hicks 2023).

One consequence of the deficiency in training is the uncertainty researchers may have about how to write code correctly, which negatively impacts their willingness to share their analyses (Gomes et al. 2022). Thus, this results in a decrease in the number of publications with available code, impacting the reproducibility and transparency of scientific research (Gomes et al. 2022; Sharma et al. 2024). This issue is exacerbated by the lack of incentives from the scientific system, leading to a high number of publications where authors do not share their code, despite the benefits of making their code open source (Allen and Mehler 2019; Melvin et al. 2022; Bertram et al. 2023; Tazare et al. 2024; Xu et al. 2025).

Encouraging researchers to actively adopting best practices and seek training in the use of computational tools that facilitate or enhance their work is desirable and should be promoted. However, leaving code management decisions entirely in their hands could have negative consequences for a research group.

Ten reasons to implement code management practices early in a research group

Would the problem be solved if future new members of the lab arrived with better training in data science? No. We believe the research group should still define its priorities when it comes to managing code.

There are several benefits to defining clear minimum guidelines and basic computational skills from the moment new members join the lab:

  1. Set a solid foundation to avoid messy projects.
    Define the file formats to be used and establish a basic file structure to ensure reproducibility from the project’s inception. Additionally, outline how the data will be managed and integrated into the analysis.

  2. Define a consistent set of practices from all the different schools of thought.
    Educational materials and training tutorials present various management practices, and researchers from different backgrounds may adopt different approaches. Therefore, providing clear guidelines ensures consistency in management practices across the projects.

  3. Focus on domain-specific skills first.
    Identifying domain-specific computational skills can save time for new researchers. This knowledge is sometimes shared in publications tailored to each discipline but is too specific to be addressed by general training courses and tutorials for scientists.

  4. Early peer review.
    In this manual, we suggest creating private repositories that are visible only to team members. Sharing analyses within these private repositories allows for valuable feedback. This practice could help researchers gain confidence in making their code publicly accessible once published and benefit from unpublished analyses conducted in the lab.

  5. Standardize documentation practices.
    For example, there could be a README template that all researchers use, making it easy to understand what can be found in a repository. This saves time, facilitates access to materials for all team members, increases project reproducibility, and makes it easier to identify repositories with older analyses.

  6. Optimize time management.
    Taking a workshop on a computational tool may occur at an advanced stage of a project. As a result, decisions about code organization, documentation, and file structure could have been made more effectively from the beginning, saving valuable time.

  7. Maintain the group’s research history.
    Centralizing data analyses on a repository hosting organization, such as a GitHub Organization, creates a historical archive of the group’s data analyses, ensuring continuity and avoiding dependence on researchers leaving behind their code and data when they move on.

  8. Facilitate the exchange of ideas about data and code management among team members.
    Creating guidelines helps build a body of knowledge that can be improved over time with contributions from students/researchers, allowing for discussions on which practices should be added, prioritized, or removed.

  9. Make informed decisions about what to learn next.
    A researcher might hear that they need to learn Git but have no idea what this tool is for. A brief introduction to Git and clear guidance on where to begin make it easier to assess whether learning additional skills will be useful. Supporting new members in adopting basic computational techniques from the beginning lowers the barrier for researchers to explore other tools early.

  10. Adoption of open science practices.
    If the group embraces open science, adopting these practices early will ensure that a high percentage of the code generated remains open source.

These ten reasons can serve as a starting point for opening a discussion on how to approach these topics within the research group. Leaders do not need to be experts in software development. Guiding principal investigators to select the essential tools and practices maximizes the benefits of making key decisions for the team without requiring large investments in learning.

At the same time, the existence of a research group manual allows younger researchers to share, propose, and contribute improvements on how the code is managed based on their expertise in the research area and the training they will receive. Eventually, the manual should include the criteria for publishing code and how to recognize the need to create a software package that can be used in the lab to facilitate the group’s work.

Finally, beyond these ten reasons, there is an additional benefit: demonstrating how software will be maintained throughout the project lifecycle strengthens the case for long-term sustainability. This transparency encourages funding agencies to invest in similar future projects.

How to cite this manual?

D’Andrea, F., and Silvia Stringhini. Code Management Guidelines: R and GitHub Starter Kit for New Team Members (v1.0.0). Zenodo, 2025. https://stringhinilab.github.io/GitHubProceduresLab/. https://doi.org/10.5281/zenodo.14775421

Acknowledgments

Thanks to Kelvin Lee for the time and thoughtful feedback. The insights and suggestions provided have improved the quality of this manual.

References

Abdill, Richard, Emma Talarico, Laura Grieneisen, et al. 2024. “A How-to Guide for Code Sharing in Biology.” PLoS Biology 22 (9): e3002815. https://doi.org/10.1371/journal.pbio.3002815.
Allen, Christopher, and David MA Mehler. 2019. “Open Science Challenges, Benefits and Tips in Early Career and Beyond.” PLoS Biology 17 (5): e3000246. https://doi.org/10.1371/journal.pbio.3000246.
Bertram, Michael G, Josefin Sundin, Dominique G Roche, Alfredo Sánchez-Tójar, Eli SJ Thoré, and Tomas Brodin. 2023. “Open Science.” Current Biology 33 (15): R792–97. https://doi.org/10.1016/j.cub.2023.05.036.
CodeRefinery Project. “CodeRefinery Lessons.” https://coderefinery.org/lessons/.
Goldsmith, Jeff, Yifei Sun, Linda Fried, Jeannette Wing, Gary W Miller, and Kiros Berhane. 2021. “The Emergence and Future of Public Health Data Science.” Public Health Reviews 42: 1604023. https://doi.org/10.3389/phrs.2021.1604023.
Gomes, Dylan GE, Patrice Pottier, Robert Crystal-Ornelas, Emma J Hudgins, Vivienne Foroughirad, Luna L Sánchez-Reyes, Rachel Turba, et al. 2022. “Why Don’t We Share Data and Code? Perceived Barriers and Benefits to Public Archiving Practices.” Proceedings of the Royal Society B 289 (1987): 20221113. https://doi.org/10.1098/rspb.2022.1113.
Hicks, Daniel J. 2023. “Open Science, the Replication Crisis, and Environmental Public Health.” Accountability in Research 30 (1): 34–62. https://doi.org/10.1080/08989621.2023.1962713.
Melvin, Ryan L, Steven J Barker, Joe Kiani, and Dan E Berkowitz. 2022. “Pro-Con Debate: Should Code Sharing Be Mandatory for Publication?” Anesthesia & Analgesia 135 (2): 241–45. https://doi.org/10.1213/ANE.0000000000005848.
Our Coding Club. Setting up a GitHub Repository for Your Lab - Version Control and Code Management with GitHub.” https://ourcodingclub.github.io/tutorials/git-for-labs/.
Sharma, Nitesh Kumar, Ram Ayyala, Dhrithi Deshpande, Yesha Patel, Viorel Munteanu, Dumitru Ciorba, Viorel Bostan, et al. 2024. “Analytical Code Sharing Practices in Biomedical Research.” PeerJ Computer Science 10: e2066. https://doi.org/10.1101/2023.07.31.551384.
Sherman Center Workshops. 2024. “Best Practices for Managing Your Code and Scripts You Use to Generate Your Research.” https://learn.scds.ca/dr23-24/code-best-practices.html.
Tazare, John, Shirley V Wang, Rosa Gini, Daniel Prieto-Alhambra, Peter Arlett, Daniel R Morales Leaver, Caroline Morton, et al. 2024. “Sharing Is Caring? International Society for Pharmacoepidemiology Review and Recommendations for Sharing Programming Code.” Pharmacoepidemiology and Drug Safety 33 (9): e5856. https://doi.org/10.1002/pds.5856.
The Carpentries. “The Carpentries Teaches Foundational Coding and Data Science Skills to Researchers Worldwide.” https://carpentries.org/.
The Turing Way Community. “The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research.” Zenodo. https://doi.org/10.5281/zenodo.7625728.
Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. 2017. “Good Enough Practices in Scientific Computing.” PLOS Computational Biology 13 (6): 1–20. https://doi.org/10.1371/journal.pcbi.1005510.
Xu, Edward, Anna Catharina V. Armond, David Moher, and Kelly Cobey. 2025. “Key Challenges in Epidemiology: Embracing Open Science.” Journal of Clinical Epidemiology 178: 111618. https://doi.org/10.1016/j.jclinepi.2024.111618.