What are the lab’s basic guidelines for data and code usage?

Who are we, and what do we need?

Dr. Silvia Stringhini is an epidemiologist with an extensive career. She has served as the Head of the Unit of Population Epidemiology at the Geneva University Hospitals. Her main research areas include social inequalities in chronic diseases and aging, the role of health behaviors in the genesis of social health inequalities, the biological consequences of social inequalities, and the role of environmental factors in social health disparities.

Recently, she moved her lab to the School of Population and Public Health at the University of British Columbia in Canada, where she is establishing a new team. She is in the process of welcoming new students and staff, making this an ideal time to outline how her new group will manage data and code in its publications.

We discussed with Dr. Stringhini the minimal requirements regarding code management that someone joining the lab should learn. A summary of the agreements reached and how they influenced the creation of this manual can be found in Table 1. Many of these requirements focus on creating a GitHub Organization for the lab.

Definitions: Git, GitHub and GitHub Organization

Git is a tool that helps you track changes to your files, like a digital history of your work.

GitHub is an online platform that uses Git to store files and manage changes, collaborate with others, and manage your code in a centralized location. It’s like having a shared folder with built-in tools to see what’s been changed and by whom. Each project can be stored in its own repository.

A GitHub organization is a group on GitHub that allows teams to collaborate and manage repositories together.

Table 1: Book content overview. This table presents the benefits and selected topics included in the book to guide the team on each prioritized action.
Action Benefit What does this book cover?
Centralize the data analysis of the group in a GitHub organization Preserve copies of the group’s data analyses - Steps to create a GitHub account and be added to the lab’s organization
- How to create a GitHub repository
Avoid sending confidential data to GitHub Protection of sensitive data - Use of .gitignore
- Project structure recommendations including how to organize the data folder
Select R as the primary programming language and RStudio as the IDE Standardize the software used in the lab - Installation instructions
- Recommendations on learning resources and good practices
Share a copy of each lab member’s analyses in a private GitHub repository - Create an initial version of the project that is organized and minimally documented so another lab member can understand it
- Foster the habit of performing code backups
- Receive feedback from colleagues early in the project development
- Have access to analyses from other lab members that have not been published
- How to create a private GitHub repository and what information to include
- Basic information to include in the README
- Creating a GitHub team and defining procedures to manage access to private repositories
- Develop a basic workflow for everyday use of Git and GitHub.
Store the code associated with scientific publications publicly in a GitHub repository The benefits of leaving code open source Bertram et al. (2023). This section will be developed at the time of the first paper’s publication

Additionally, considering the lab’s long-term evolution, onboarding and offboarding procedures were defined.

Justification

We recognize that creating the code for a scientific publication takes time and involves numerous attempts before deciding what figures and results effectively will be published. Keeping this in mind, it was decided that each student would generate a private GitHub repository by project to maintain a backup of daily data analyses conducted in the lab. This private repository could then serve as the foundation for a public version for the final GitHub repository with the scientific article’s code. Publicly sharing the code and data management process can be more challenging for early-career researchers (Gomes et al. 2022; Tazare et al. 2024).

Maintaining this initial private repository has other benefits in addition to functioning as a backup: it allows sharing the code with other lab members (as part of a GitHub lab team), makes available analyses that may not be included in the final paper but could be relevant for another publication, helps keeping a clearer project structure from the beginning and improves the overall documentation of the project.

Characteristics specific to the research area were discussed, including handling sensitive data (Mathur and Fox 2023; The Turing Way Community 2023). As a result, practices like using .gitignore and and creating a data/ folder with raw/ and processed/ sub folders were suggested to prevent private data from being pushed to GitHub and to maintain an orderly system for storing such information within the project. Also, we created a README template designed to be accessible to non-programmers to be sure that the relevant information, as the database version in use and computational environment, is captured.

What is a .gitignore file

A .gitignore file is a special file used in Git repositories to specify which files and directories should be ignored, meaning they won’t be tracked or included in version control. This is useful for excluding temporary files, sensitive data, or files that shouldn’t be shared with others.

One of the more challenging aspects to adopt is using Git, as it has multiple utilities and a considerable learning curve. Considering this, it was decided that, in this initial stage, Git and GitHub’s primary use would be to create an online and centralized backup of the projects, share repositories among team members, and manage version control instead of focusing in collaborative tools.

Since R is the most widely used programming language in the discipline, the team decided to leverage the Git integration provided by RStudio’s Git tab for integrating students’ local work into the GitHub repositories.

What is the difference between R and RStudio?

R is a programming language and software environment used for statistical computing and data analysis.

RStudio is an integrated development environment (IDE) for R, providing a user-friendly interface with tools for writing, editing, and running R code, as well as visualizing data and managing projects.

References

Bertram, Michael G, Josefin Sundin, Dominique G Roche, Alfredo Sánchez-Tójar, Eli SJ Thoré, and Tomas Brodin. 2023. “Open Science.” Current Biology 33 (15): R792–97.
Gomes, Dylan GE, Patrice Pottier, Robert Crystal-Ornelas, Emma J Hudgins, Vivienne Foroughirad, Luna L Sánchez-Reyes, Rachel Turba, et al. 2022. “Why Don’t We Share Data and Code? Perceived Barriers and Benefits to Public Archiving Practices.” Proceedings of the Royal Society B 289 (1987): 20221113.
Mathur, Maya B, and Matthew P Fox. 2023. “Toward Open and Reproducible Epidemiology.” American Journal of Epidemiology 192 (4): 658–64. https://doi.org/10.1093/aje/kwad007.
Tazare, John, Shirley V Wang, Rosa Gini, Daniel Prieto-Alhambra, Peter Arlett, Daniel R Morales Leaver, Caroline Morton, et al. 2024. “Sharing Is Caring? International Society for Pharmacoepidemiology Review and Recommendations for Sharing Programming Code.” Pharmacoepidemiology and Drug Safety 33 (9): e5856.
The Turing Way Community. 2023. “The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research.” Zenodo. https://doi.org/10.5281/zenodo.7625728.