Lesson 1 Introduction to Git and GitHub

1.1 Learning goals

In this lesson, you will learn how to keep track of your file changes using Git. After this lesson:

  • you know the different steps of the Git workflow (commit, push and pull);
  • you know how to set up a new Git repository;
  • you can keep track of your file changes using Git, RStudio and GitHub;
  • you can instruct git to ignore certain files for version control (gitignore);
  • you can create a personal development environment (branch);
  • and you can add your personal changes to the work of the group (pull request).

1.2 Prerequisites

Before you continue with this lesson, make sure you have done the following:

1.3 Before you start!

A reminder of the exercise and assignment directives.

Within the lesson you will encounter a number of exercises.

Please answer all the assignments and exercises in an RMarkdown document. You learned how to make one in DAUR1. Save it somewhere in your course project.

Most lessons have portfolio assignments! Save those for now as separate .Rmd files in a portfolio-project.

So, you will have at least 3 R projects (file –> new project…) going on this semester: one for all the regular exercises in the lessons. One for your portfolio. A third one for projecticum.

We suggest to call them:

  1. dsfb2_workflows_exercises
  2. dsfb2_workflows_portfolio
  3. dsfb2_research_project

Start by setting these projects up in your RStudio environment (server or local)

Once you are done doing this, you are ready to start this course!

Note: There will be a portfolio exercise at the end of each lesson. For each portfolio exercise, make a new .Rmd file within your _portfolio project.

1.4 Why Git and GitHub?

To explain why Git and GitHub are so important, let us look at an example. Suppose you are analyzing some data for your internship using R. You have written an R script to perform the data analysis and the script is working. However, one day you decide to make some changes to the script in order to make the computations a bit faster. You work the whole day on the code and at the end of the day you try if the code is working as expected. Unfortunately, it is not working as expected. But the code was working before. It would be great if you could go back to the previous version of the code. Or if you could easily see the changes you made, to find out why your script stopped working.

The example shows that it is important to keep track of your changes. Of course you could have made your changes in a copy of the original script. But in this way, you would end up with a lot of files on your computer. In addition, it is not easy to see what you changed. This is why we use version control software: software that keeps track of file changes and allows you to inspect the changes and to go back to previous versions of the files.

Git is the most popular version control software used today. It is designed to track changes of text files (files you can open in a simple text editor, such as R scripts and markdown files). Git can be easily used to track the changes of files that are on your computer.

However, you often want to work together with others (for example, fellow students or colleagues) on the same scripts. GitHub is a web-based platform that was created to promote collaboration. GitHub is based on the Git software. The platform allows you to create an online copy of your files and the Git version history. This copy can then be shared with others, who can then also contribute to the code themselves. You can use the online copy on GitHub also for yourself, for example as a backup of your code in case your computer crashes or you accidentally remove some folders on your computer.

1.5 github video

Some people are way better at explaining stuff than at visual animations. Watch the first 5 1/2 minute of this video

1.6 The Git workflow

You are hopefully convinced that Git and GitHub are the way to go. But how do you start? How can you use Git and GitHub to keep track of the file changes in your scripts? In this section, you will learn the different steps of the Git workflow. We will focus on the concepts. In part two of this lesson, we will give you a detailed protocol for how to do things by yourself.

1.6.1 Setting up a Git repository

Before you can use Git and GitHub to track your file changes, you need to set up your own Git repository. A Git repository is a folder in which the version control takes place. That means that all the changes of files that are located in that folder (and its subfolders) are tracked. It is best practice to initiate the Git repository on GitHub and then download it to your computer, as will be explained in part two of this lesson. This results in a folder on your computer. All files that you place in this folder will be under version control by Git.

Just as you can have multiple folders on your computer, you can also have multiple Git repositories on your computer (and on GitHub). It is recommended to create a separate Git repository for each project you are doing. For example, if you create scripts and markdown files for your internship, you create a Git repository that only contains the work of this internship. If you performed several different research projects in your internship, you can create separate Git repositories for each research project.

1.6.2 Saving a new version of your files (commit)

Once you have placed files in the Git repository folder, you can start to track the file changes. It is important to note that Git does not automatically track the file changes. If you do not tell Git to save the version history, it will not do anything! Hence, you need to actively tell to Git to save the file changes. The action of saving a new version of your files in the Git repository’s file history is called a commit. It is recommended to regularly commit your changes, preferably several times a day.

1.6.3 Updating the GitHub repository (push)

When you commit your changes, the Git repository on your computer will be updated: a new version of your files will be added to the version history of the Git repository on your computer. However, the online copy of your Git repository on GitHub does not get updated, unless you tell Git to do so. The process of updating your Git repository on GitHub based on local changes (new versions on your computer) is called pushing (you ‘push’ the changes from your computer to GitHub). You can push your changes each time you commit your changes, but it is also possible to push all the commits at once at the end of the day. However, it is recommended to push your changes at least once per day, to keep the copy of your Git repository on GitHub up to date.

1.6.4 Updating your local Git repository based on changes in GitHub (pull)

If you work together with other persons, it may be the case that the Git repository on GitHub has been changed by your collaborators/colleagues. These commits are then present in the Git repository on GitHub, but not in your local Git repository. In other words, the Git repository on your computer is lagging behind the Git repository on GitHub. In this case, you want to update your Git repository on your computer based on the changes on GitHub. This update process is the reverse of pushing, hence it is called pulling (you ‘pull’ the changes from GitHub to your computer).

Pulling is only important if other persons contributed to the repository. In part two of this lesson, we will focus on the simple situation were you work alone on your files and you are the only person contributing to the Git repository on GitHub. In part three, we will explore the best practices for working together on the same files.

1.7 Git and RStudio

In this section, we will give you a detailed protocol for working with Git using RStudio. This protocol assumes that you are the only person that contributes to the Git repository. If you work together with other persons, you will need additional tools than the ones presented here. You will learn more about these tools in the next lessons.

1.7.1 Creating a new Git repository

Before you can track your file changes using Git, you first need to set up your Git repository. It is best practice to create your Git repository on GitHub and then download a copy of the Git repository to your computer. We explain these two steps in detail below.

Step 1: Creating a new Git repository on GitHub

To set up your Git repository, you go to the GitHub login page and log in with your credentials. You are then taken to the home page. On the left side, you can find a panel with your repositories (if you have any). Here, you can click on ‘New’ to create a new repository (see figure below, top panel).

Once you click on ‘New’, you will be taken to a menu were you can specify the details of your new Git repository. The menu consists of the following parts (see also figure below, bottom panel):

  1. Repository name field: in the repository name field you can specify the name for your new repository. GitHub will automatically check if another repository with the same name already exists.
  2. Public or Private: when you create a new repository, you have to indicate if you want to create a public or a private repository. A public repository can be found and downloaded by anyone on the internet (although only specified persons can contribute to the Git repository). A private repository is only accessible to persons specified by you (the owner). Although you promote reproducible research by creating a public repository, this not always possible. Especially if you are employed (or doing an internship!) at a company or other agency, your employer often does not allow you to make your files public. Always check with your employer (or supervisor) to see what is allowed. Choose a private repository if you are not sure!
  3. Initialization files: when you create a repository, you can create some ‘initialization files’. These files include a README file, a gitignore file and a license:
  1. README file: a README file is a file that contains information about the repository. This file should be read by everyone that is planning to use the files in the repository. It usually contains information about the content of the Git repository. You can create the README file when creating the Git repository, but you can also do it later.
  2. gitignore file: we will explain the purpose of the gitignore file in detail below. In short, the gitignore file contains all the file names of files that should not be tracked by Git. That means that you cannot commit changes to files that are in the gitignore file. On GitHub, you can easily create a gitignore file that is specific for the programming language you are using. It is therefore recommended to create a gitignore file (for example for R) when creating the Git repository.
  3. License: when you create a public repository, it is important to add a license to your repository. The license states what others can do with the files in the Git repository. For example, with an MIT license, you allow others to use your file for any purpose they like (even for commercial applications). Similarly, if you want to use the files of others, it is important to check their license to see what you can and cannot do. Please check the resources for more information about licenses.
Creating a new Git repository on GitHub. See the main text for the details.

Figure 1.1: Creating a new Git repository on GitHub. See the main text for the details.

Let’s try the git workflow on github first.

Follow the tutorial here starting from “commit your first change”.

Step 2: Downloading the Git repository to your computer (cloning)

Once you have created a Git repository on GitHub, you can download the repository to your computer. Downloading a Git repository is called cloning. The steps for cloning a Git repository are as follows:

  1. Copy the GitHub link to the new Git repository (see figure below, top panel).
  2. In RStudio, go to File > New Project…
  3. In the Project menu, select ‘Version Control’, then ‘Git’ and subsequently enter the details of the Git repository (see figure below, bottom panel). Also select the folder in which you want the repository to be located.
  4. Click ‘Create Project’.

RStudio will now create a new R Project and clone the Git repository from GitHub.

As you learned in the DAUR1 course, it is best practice to use R Projects if you are working with RStudio. Using the above steps, you can easily couple the R Project to the Git repository. All files in the R Project folder will now be tracked for version control. These files can be R scripts, (R)markdown files, but also (among others) bash scripts.

Downloading the Git repository to your computer using RStudio. See the main text for the details.

Figure 1.2: Downloading the Git repository to your computer using RStudio. See the main text for the details.

1.7.2 How to work with Git in RStudio

OK, you are ready to go! You have created your first Git repository and you can now start to add files and track their changes. In this part, we will show you how to commit your changes and how to push your changes to the Git repository on GitHub.

1.7.2.1 Git personal access token

Github uses personal access tokens. You will have to generate one on github, and tell Rstudio / git on your computer what is is.

Copy the following to the console:

install.packages("gitcreds")
library(gitcreds)
gitcreds_set()

Rstudio asks for a github token. Let it wait for a bit. Go to your browser. Generate your token by following these steps.

Copy your token (you only get to see it once!) and paste it in Rstudio (who was still waiting for your reply).

You will probably have to give Rstudio your github name and password later as well. We’ll get to that.

Getting started with Git in RStudio

We will start by creating a new R script and saving it in the directory of the Git repository. In the right upper corner of RStudio, you can find a submenu called ‘Git’. If you open this menu, you can see the changes you have made to the repository (see figure below). These changes may be new files that you just added to the repository, but they also can be changes to the content of pre-existing files.

Git menu in RStudio.

Figure 1.3: Git menu in RStudio.

In this Git menu, you can perform all your Git actions, including committing your changes, pushing your changes to GitHub or pulling GitHub changes to your computer. We will now discuss these steps in detail.

Saving a new version of your file (commit)

You just created a new R script. In addition, by cloning your Git repository from GitHub, you created a new R Project file. These changes need to be stored in the Git repository. In other words, you need to commit the changes to the Git repository. You can do this by clicking on the ‘Commit’ button in the Git menu. A new window will then open. In this window, you can select the changes you want to commit to the Git repository (see figure below, left upper corner). Note that by selecting a file, you can see the changes in the bottom panel of the window (in which lines in green are new/added lines and lines in red are old/removed lines). After selecting the files you want to commit, you can write a commit message in the upper right panel. This commit message is necessary to later on easily distinguish the different commits in GitHub. It is therefore important to write commit messages that clearly indicate the changes you made in the commit. Finally, you can commit your changes by pressing the ‘Commit’ button. You can now close the commit window (also the other window that pops up after pressing the ‘Commit’ button).

Committing changes in RStudio.

Figure 1.4: Committing changes in RStudio.

Updating the GitHub repository (push)

After committing your changes, you have to update the GitHub repository. You can do this by pushing the recent commit(s) to GitHub using the ‘Push’ button (with the green, upward pointing arrow) in the Git menu.

The first time you push commits to GitHub from a repository in RStudio, you will be prompted for your user name and password. Likely, Rstudio will save your credentials, so that you will not have to enter them again. If it doesn’t, type the following in the terminal:

git config –global user.email “

git config –global user.name “Your Name”

After pushing your changes, you can see the new/changed files in GitHub (you may need to refresh your page to see the changes). GitHub also offers a nice interface to see the exact file changes for each commit. In the main menu of your repository, you can click on the ‘commits’ link and subsequently select your most recent commit. You will then see all file changes that were committed during that commit (see figure below).

Check out your commits in GitHub.

Figure 1.5: Check out your commits in GitHub.

Ignore files for version control (gitignore)

Some file changes do not have to be tracked by Git. These include file changes to files that are generated automatically by R (such as .RData and .Rhistory files) or other software. You can indicate to Git to ignore these files by creating a gitignore file in the Git repository. This is just a simple text file that is saved as ‘.gitignore’ in the Git repository. In this file, you store the names of the files you want to ignore. You can also ignore entire directories or files with a certain file extension (see this page for the syntax).

As indicated above, you can easily create gitignore files on GitHub when you create your Git repository. On GitHub, you can select your programming language and then create a gitignore file that ignores system files of that programming language. You can always add other files and directories to the gitignore file yourself later.

1.7.3 Using Git outside RStudio

In the text above, we explained how to use Git in combination with RStudio. However, you may also want to use Git with other programming languages, such as Python or Bash, outside RStudio. Of course this is possible. In fact, Git was designed to be used independently, as a simple command line tool. Once you install Git on your computer, you also install the Git Bash command line tool. You can use this tool in every directory of your computer to initiate Git repositories or to maintain them. Below, we will show you how you can do this and we learn you the most important commands.

Suppose you want to clone the Git repository outside RStudio in a directory on your computer. You can do this by browsing on your computer to the destination folder. In this folder, you can right click with your mouse and select ‘Git Bash Here’ (see figure below). A command line will open, which is basically a Linux command line.

Open Git Bash tool in a folder on your computer.

Figure 1.6: Open Git Bash tool in a folder on your computer.

You can now clone the Git repository from GitHub using the following command (of course replacing the link with the link of your own repository):

git clone https://github.com/basvangestel/test.git

A new folder will now be created. In this folder, the files of your Git repository are located. Now you can go to this folder and add/change files. With the following command you can see which files have been changed:

git status

You can commit the file changes with the following command:

git add --all; git commit -m 'your-message'

This command may look a bit difficult. The first part (git add --all) corresponds to selecting (all) the files in the RStudio commit window. Note that with this command you commit all file changes, unless they are ignored as indicated by the gitignore file. The second part (git commit -m 'your-message') corresponds to writing your commit message and pressing the commit button in the RStudio commit window.

Finally, you can push and pull changes using the following commands:

git push origin main
git pull origin main
Exercise 1

making a repo

  1. Create a new Git repository on GitHub. Make sure that the Git repository contains a gitignore file for R.
  2. Clone the Git repository to your computer using RStudio (hence creating an R Project on the fly).
  3. Create an RMarkdown file in the Git repository with the following content:
---
title: "My own RMarkdown file"
output: html_document
---

This is my first RMarkdown file tracked by Git!!
  1. Commit the changes to the Git repository and push the changes to GitHub. Check the commit on GitHub to inspect the file changes.
  2. Add some extra lines to the RMarkdown file and change the title. Commit the changes and push the changes to GitHub. Check the commit on GitHub to inspect the file changes.
Exercise 1

gitignore

Use the Git repository of the previous question for this question.

  1. Create a directory called ‘data’. Add a few Excel files in this directory and also some csv files (you can also just create some empty files using Excel). Check the Git menu in RStudio (but do not commit anything!).
  2. Update the gitignore file in such a way that all the files in the data directory are ignored. Check the Git menu in RStudio (refresh if necessary). What do you see?
  3. Update the gitignore file in such a way that all the Excel files in the data directory are ignored. Check the Git menu in RStudio (refresh if necessary). What do you see?
Click for the answer
  1. Add to the gitignore file a line with data/**. The files in the data directory are removed from the Git menu.
  2. Change in the gitignore file the line data/** for data/*.xls. Now only the Excel files are ignored, but the csv files appear in the window again.
Exercise 1

Clone & commit

  1. Clone the Git repository that you created on GitHub (the same one again) to a folder on your computer using the command line.
  2. Change the RMarkdown file. Check the changes using git status. What do you see?
  3. Commit your changes and push the changes to GitHub using the command line.
Exercise 1

Mess around

Try changing some stuff, committing and pushing again.

Take some time to get used to this workflow.

Click around in github and get to know the site. We will discuss branches next lesson and issues and projects the lesson after.

What could you do if github doesn’t accept your push command from Rstudio anymore?

Click for the answer

If things get so messy that you are not able to upload a changed file, just download a new clone of your repo (file –> new project –> version control –> etc). Then copy your changed file into this new folder, commit and push.

You could also check if your mess up is listed here or if you prefer, the same site without swears here.

Portfolio assignment 1.1

Start your portfolio with a CV

  1. Create a Git repository on GitHub for your portfolio.

Subsequently, clone the GitHub repository to your computer. Copy paste your portfolio-Rmd-files to this folder. You will see the new files showing up in the “git” window in the upper right of the screen in Rstudio. Commit/push the changes to your repo.

  1. Now start building your CV in a new Rmarkdown file and keep using github for version control. Remember to commit often, and push your changes! Build something that you would want to send to a possible future bioinformatics internship position, or when job hunting. So it needs to be complete and professional looking.

You can look around online for examples of cv’s (/resume). Your cv will be part of your portfolio.

Portfolio assignment 1.2

looking ahead

In class, we discussed this assignment yesterday. In the following weeks, start looking for what you will want to spend your extra time on this semester.

Try, for yourself, to answer the following questions:

  • Where do I want to be in ~2 years time?
  • How am I doing now with respect to this goal?
  • What would be the next skill to learn?

Make a planning on how to start learning this new skill.

Note the following points:

  • It has to be life sciences or useful for life sciences (no knitting, riding horses etc)
  • It has to be data science / bioinformatics
  • You have to show your plan in your portfolio, together with the result of the first few steps you took in trying to reach this goal.
  • You probably won’t learn this skill in a few days. Just get started.

This assignment will show us whether you succeeded in reaching an overview of the field and possibilities in data science for biology.


CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Unless it was borrowed (there will be a link), in which case, please use their license.