Code Management Guidelines
R and GitHub Starter Kit for New Team Members
GitHub Organization: www.github.com/StringhiniLab
Goal
The goal of this manual is to provide the minimum necessary guidelines for new members of Dr. Silvia Stringhini’s lab to follow agreed-upon practices in code management.
Introduction
The use of programming languages has become an essential part of data analysis for most researchers today. In this context, a basic skill set in computer science is key to ensuring reliable and reproducible results (Wilson et al. 2017; Hicks 2023; Abdill, Talarico, and Grieneisen 2024). Although a variety of educational materials, tutorials, and recommended practices specifically designed to train researchers are available (The Carpentries; Our Coding Club; The Turing Way Community 2023; CodeRefinery Project), there is a trade-off: adopting and practicing these techniques often requires significant effort, taking time away from researchers’ primary fields of study (Allen and Mehler 2019; Goldsmith et al. 2021; Hicks 2023).
One consequence of the deficiency in training is the uncertainty researchers may have about how to write code correctly, which negatively impacts their willingness to share their analyses (Gomes et al. 2022). Thus, this results in a decrease in the number of publications with available code, impacting the reproducibility and transparency of scientific research (Gomes et al. 2022; Sharma et al. 2024). This issue is exacerbated by the lack of incentives from the scientific system, leading to a high number of publications where authors do not share their code, despite the benefits of making their code open source (Allen and Mehler 2019; Melvin et al. 2022; Bertram et al. 2023; Tazare et al. 2024; Xu et al. 2025).
Encouraging researchers to actively adopting best practices and seek training in the use of computational tools that facilitate or enhance their work is desirable and should be promoted. However, leaving code management decisions entirely in their hands could have negative consequences for a research group.
Ten reasons to define code management practices from day one
Would the problem be solved if future new members of the lab arrived with better training in data science? No. We believe the research group should still define its priorities when it comes to managing code.
There are several benefits to defining clear minimum guidelines and basic computational skills from the moment new members join the lab:
Avoid messy projects from the start.
Centralizing data analyses on a GitHub Organization and creating standards for pushing code promotes improved repository structuring, version control, and better-documented code, ensuring reproducibility from the project’s inception.Implement minimum documentation and project management best practices.
Defining group-level criteria for code and data management facilitates collaboration, saving time and avoiding errors.Focus on domain-specific skills first.
Identifying domain-specific computational skills can save time for new researchers.
This knowledge is sometimes shared in publications tailored to each discipline but is too specific to be addressed by general training courses and tutorials for scientists, being the only exception we know Data Carpentry (Data Carpentry 2024).Early peer review.
Sharing analyses with team members in private repositories allows for valuable feedback. Although initially restricted, this practice fosters confidence in making code publicly accessible upon publication.Define a set of practices that should not be overlooked.
Not all researchers who take a course in Git and GitHub will make their code available if there are no guidelines on whether it is expected of them to do it or not and how. Failing to define certain guidelines will result in each researcher adopting these practices in varying degrees.More efficient use of time. Taking a workshop on a computational tool may occur at an advanced stage of the project. As a result, decisions about code organization, documentation, and file structure could have been made more effectively from the beginning, saving valuable time.
Maintain the group’s research history.
This approach helps create and standardize a historical archive of the group’s data analyses, ensuring continuity and avoiding dependence on researchers leaving behind their code and data when they move on.Facilitate exchange of ideas about data and code management among team members. Creating guidelines helps build a body of knowledge that can be improved over time with contributions from students/researchers, allowing for discussions on which practices should be added, prioritized and/or removed.
Make informed decisions about what to learn next.
A researcher may hear that they should learn to use GitHub. By explaining from the beginning what GitHub is and the minimum knowledge required, it becomes easier for them to assess if they should focus on learning additional skills or not. Supporting new members of the research group in adopting basic computational techniques from the start lowers the barrier for researchers to explore other tools early.Adoption of open science practices. If the group aims to begin making research code available, these guidelines and training will effectively promote leaving the code open source.
D’Andrea, F., & Stringhini, S. Code Management Guidelines: R and GitHub Starter Kit for New Team Members. https://github.com/StringhiniLab/GitHubProceduresLab. Available at: https://stringhinilab.github.io/GitHubProceduresLab/ DOI: https://doi.org/10.5281/zenodo.14510774