Lesson_02 - The genome
Learning objectives
The student is able:
- To obtain (genomic) data from http://www.ensembl.org/index.html
- To obtain (genomic) data from https://bacteria.ensembl.org/index.html
- To analyse (genome) data using the data analysis workflow in JASP
- To organise and transform data using bash
The human genome
To understand human (disease) phenotypes it is essential to understand the human genome. The long lasting molecular quest is to connect a genotype to a phenotype. For example which genes are involved in cell cycle regulation, muscle contraction, immune reaction, development and so on. In a similar way, we want to connect a genotype to a disease phenotype such as cancer or how genotypes of bacteria confer antimicrobial resistance. Mutations in the genome affecting gene function can possibly lead to cellular dysfunction and subsequently adverse health effects. Therefore it was of critical importance to sequence the human genome.
The first version of the human genome was published in 2001 and in 2022 a complete version was published. It was a bit of an disappointment that the human genome contained about 20.000 genes (compared to for example rice which has about 40.000 genes). Within the human genome, distinct functional elements can be defined (Figure 11)
Protein coding genes constitutes for about 2% of the human genome. The remaining nucleotides are so called “noncoding DNA”. In recent years it has become clear that parts of these regions are also transcribed in RNA but not translated into proteins (non-coding RNA). Furthermore the human genome contains a lot of “regulatory elements” which control gene transcription (promoters, enhancers, silencers, eRNA) and the 3D structure of the human genome (folding of the genome, which information is / becomes available)
Splicing
The process of splicing (Figure 12) dictates which exons are present in a transcript. Different transcripts from a gene translate into proteins with distinct functions. Thus, although we have about twenty thousand genes they can possibly produce a multitude of transcripts and thus proteins.
Ensembl database
The ensembl database contains information about hundreds of vertebrate genomes including the human genome. With the ensembl BioMart tool we can obtain all genes + associated features of a vertebrate genome.
We will acquire a list of genes of the human genome and start to explore (describe) genome and gene features by using the first 4 steps of the data analysis workflow:
Obtaining data:
- Go to the ensembl database (Figure 13) and select BioMart
- Within BioMart (Figure 14)
- CHOOSE DATABASE -> Ensembl Genes 107
- CHOOSE DATASET -> Human genes (GRCh38.p13)
- Click on Filters
- Open drop down menu of REGION
- Select 1-22 and X and Y (X and Y are at the end of the list) as shown in Figure 15
- Click on Attributes (left panel)
- Features
- Open drop down menu of GENE
- Select the options as shown in Figure 16
- Click on the left upper panel on Results
- Select as shown in Figure 17 (Don’t forget to check “Unique results only”)
- Click on GO
- The data file will be downloaded to your computer
- Rename the data file mart_export.txt to lesson_02_opdracht1-2_ensembl_GRCh38.txt
- upload the ensemble file to your home directory-> lesson_2
JASP
Excel is commonly used for data analysis. Excel is useful for early steps in data cleaning and organisation. Excel is also very useful as a calculator. However (big) data analysis in Excel is often very laborious and more importantly data analysis in Excel is not reproducible:
- Reorganizing data in Excel increases the chance of making mistakes
- No warnings are issued when you make a mistake, for example when copied formulas are applied to the wrong columns
- No debugging options in Excel
- Excel tends to auto format text so original descriptions get lost (for example numbers could be changed to a date!!)
- Excel is limited to about one million rows. Not suitable for big biological data sets
- Speed of computation for complex data transformations is very low and could cause computer crashes
- The steps you take in Excel are not documented. Therefore it is not clear how you reached a certain conclusion
- Lack of automation: if you have to repeat the analysis in a slightly different way on the same data or in a similar way on a different data set you have to start all over again because the steps were not documented
- Difficulty in collaboration. Simultaneous working on data analysis, commenting and sharing of data files is not possible in Excel. Different versions of Excel on windows / Mac can already give problems in data transfer
For reproducible data analysis we have to use alternative software to document all the data analysis steps which have been used to generate graphs, statistical tests and conclusions. It is important that these software alternatives are free of charge so everyone can download and repeat your analysis using the same software. Here we will make use of JASP. We will also use bash and R. JASP can be installed on your computer from the HU software center or otherwise from the JASP download page.
Data import
JASP is an open source statistical program based on the R language. Using JASP we can easily perform descriptive statistics, parametric and non-parametric tests as well as correlation and regression analysis. JASP accepts different types of data files:
- JASP files
- txt files
- csv files (comma separated values)
- tsv files (tab separated values)
- ods files (open document spreadsheet)
Ods files are from libreoffice. Libreoffice is a free alternative to microsoft office to promote open science. Within Excel you can save data files into the ods file format.
To demonstrate JASP we will continue with the ensembl human gene data file lesson_02_opdracht1-2_ensembl_GRCh38.txt.
- Open JASP
- Click on the 3 three blue lines (upper left) -> Open -> Computer -> Browse
- Browse to the folder containing the ensembl data file
Data inspection
After importing the data file it should look like Figure 18
Data filtering
Each column name starts with a pictogram indicating the type of data (nominal, ordinal or quantitative). To check which categories are present in a column with nominal data click for example on the column name “Gene type”. On top of the data a list will appear with the unique values of that column. We observe (after simple counting) that there are 38 different gene types (Figure 19).
On the left hand side it is possible to select (filter) categories of interest by un-checking categories that should not be part of the analysis (Figure 19, purple rectangle). For example we can compare the properties of protein_coding, lncRNA and miRNA gene types (in stead of all 38 different gene types). Click either on the black cross (Figure 19, black rectangle) or again on the column name (Figure 19, red rectangle) and the window will disappear.
Descriptive statistics
We can now easily calculate all kinds of descriptive statistics for this data file. Before we start the analysis we have to think about interesting research questions and our experimental setup. Our research model is the human genome. We have observed that there are 38 different gene types in the human genome. We can consider these categories as our experimental groups or experimental variable. For each group we want to know their properties (=outcome variable), for example the average number of transcripts:
“Gene type” (= experimental variable) -> human genome (model) -> “Transcript count” (= outcome variable)
- Click within the top menu on the “Descriptives” button
- Select “Gene type” (= experimental variable) and drag to the “Split” window (Figure 20)
- Select “Transcript count” (= outcome variable) and drag to the “Variables” window
- Click on “Transpose descriptive table” (for easier view of the table)
To add the median to the descriptive table:
- Click on the “Statistics” (left panel lower part)
- Select the median option (Figure 21)
In the “Results” panel we see a descriptive summary of “Transcript count” for each type of gene. The line “valid” counts how many cases are in each category of “Gene type” (Figure 22)
lesson_02_assignment
Clean up the results section to only show the following “Gene type” categories:
- protein_coding
- lncRNA
- miRNA
- How many protein_coding, lncRNA and miRNA genes are present in the human genome?
- What is the median transcript count of protein_coding, lncRNA and miRNA genes?
- What is your conclusion of (B)
- In JASP, save the file as lesson_02_opdracht1-2_ensembl_GRCh38.jasp
Creating new variables
Within JASP we can calculate a new variable based on existing variables. Here, we calculate the gene length which is the difference between the variables “Gene end (bp)” and “Gene start (bp)”.
- To add an extra column (= variable) click on the + sign next to the last column (Figure 23)
- Fill in the name of the column. In this case “Gene length” (Figure 24)
- Click on “Create Column”
- Calculate the formula by dragging the variables “Gene end (bp)” and “Gene start (bp)” in the formula box. Don’t forget to put the minus sign in between (Figure 25)
- Click on “Compute Column”
- Click on the black cross (lower right of the formula box) to close the formula window
NOTE : Within a data file you can perform many different descriptive analysis. For each Descriptive analysis you can open a new “Descriptive Statistics” panel (Figure 26). To rename each tab click on the black circle icon with pen. Write a new informative name and press enter.
lesson_02_assignment
- Click on “Descriptives” to open a new “Descriptives” tab
- Rename the tabs of assignment 1 and 2 as shown in figure 26
- Calculate the average (mean) “Gene length” of protein_coding, lncRNA and miRNA genes?
- What is your conclusion?
- Save the file in JASP and upload the file to your home folder -> lesson_2
There are more gene characteristics we can study. In Biomart it is possible to select a range of options
- Go to the ensembl biomart database as before and add the following attributes to the human genome:
- Transcript length (including UTRs and CDS)
- Gene % GC content
- Check if you have still selected chr1-22 and X and Y
- Check if you have selected 10 attributes in total (8 of the previous analysis + 2 extra)
- Download the data
- Rename the data file mart_export -> lesson_02_opdracht3-4_ensembl_GRCh38.txt
- Open the data in JASP
- Filter the data file for protein_coding, lncRNA and miRNA type genes
lesson_02_assignment
A researcher wants to calculate the average (mean) transcript length for each of the three Gene types (we already have calculated the average gene length, see assignment 2)
“Gene type” (= experimental variable) -> human genome (model) -> “Transcript length (including UTRs and CDS)” (= outcome variable)
- Open a new descriptive tab
- Name the tab “Descriptive Statistics-Lesson_02-opdracht3”
- What is the average (mean) “Transcript length (including UTRs and CDS)” of each gene type?
- Compare the values of (A) to the average “Gene length” (see assignment 2)
- What is your conclusion?
- In JASP, save the file as lesson_02_opdracht3-4_ensembl_GRCh38.jasp
Another important feature of the genome is the % GC content. For example species have different % GC content in their genomes. Also within a genome there are marked differences of GC content between the different genome elements such as genes, non_coding DNA, teleomeres and centromeres.
lesson_02_assignment
A genome researcher wants to know the average (mean) “Gene % GC content” for each chromosome:
“Chromosome/scaffold name” (= experimental variable) -> human genome (model) -> “Gene % GC content” (= outcome variable)
- Open a new descriptive tab
- Name the tab “Descriptive Statistics-Lesson_02-opdracht4a”
- Which chromosome contains the highest mean “Gene % GC content”?
A genome researcher wants to know whether “Gene % GC content” is normally distributed for protein_coding, lncRNA and miRNA genes:
“Gene type” (= experimental variable) -> human genome (model) -> “Gene % GC content” (= outcome variable)
- Open a new descriptive tab
- Name the tab “Descriptive Statistics-Lesson_02-opdracht4b”
- Are the “Gene % GC content” for each type of protein_coding, lncRNA and miRNA gene normally distributed?
HINT: Open the “Plots” menu within Descriptive Statistics and select the correct figure type
- Save file in JASP and upload the file to your home folder -> lesson_2
The bacterial genome
The bacteria.ensembl database contains information about thousands of bacteria and archaeal genomes. A bacterial genome is very differently organised compared to eukaryotic genomes.
Features of a bacterial genome:
- single circular DNA molecule (in cytosol)
- includes plasmids
- haploid
A researcher want to know how the genome of bacteria and vertebrate are structured and if there are marked differences. First, let’s calculate the gene density of the Helicobacter pylori genome and the human genome. The Helicobacter pylori is becoming increasingly resistant to antibiotics and causes a variety of (serious) health problems. The gene density is defined by the number of protein coding genes per million bp. Thus, we need to obtain two characteristics of the genomes.
- Total length of the genome
- The number of protein coding genes
Go to the bacterial ensembl database (Figure 27)
- In the “Search for a genome” box type Helicobacter pylori
- Click on any Helicobacter pylori with TaxID 210 (It doesn’t matter which one, the genome length and number of protein coding genes are more or less in the same range)
- Click on “More information and statistics” (Figure 28)
- Retrieve the number of protein coding genes and genome length (Golden Path Length)
- Obtain genome length and number of protein coding genes of the human genome (as describes for the bacterial genomes but use the vertebrate ensembl database.
lesson_02_assignment
Calculate the gene density in the Helicobacter pylori genome and the human genome?
What is your conclusion?
We observe that there is a difference in gene density between prokaryotic and eukaryotic genomes. As a researcher we want to know why? Can we explain this difference by the way these genomes are organised? Let’s further inspect the Helicobacter pylori genome using the bacterial ensembl database (see previous assignment):
- Click on GFF3 link (Figure 28)
- Click on the Helicobacter pylori link
You have downloaded data file Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3.gz. The General Feature Format (GFF) is a tab-delimited text file that holds information on genes. Notice that the files end on .gz meaning that the file is zipped.
- Transfer the file to directory lesson_2 within your home directory
Data analysis using bash
Bash contains numerous commands that can be used for data inspection, cleaning, organization and analysis. The advantages using bash are:
- With (a few) bash commands we can inspect, clean, organize and analyse data files
- Bash commands can be saved in a script for reuse and reproduction
- Data analysis workflow can be automated for multiple input files
Bash: tab-completion
The name of our data file is Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3.gz. This is a very long and complicated name. It will take a lot of time to type this file name in the terminal and probably we will make a mistake. But bash has an elegant solution for this problem (bash has a solution for everything!)
Bash has a useful feature namely tab-completion. If you press TAB after you have written the first 1 or 2 letters of a folder, file or function, Linux automatically fills out the name. If the name is not unique, Linux will show all names that start with the entered letters. For example, if you enter “le” and then press TAB, you get “lesson”. If you then press TAB again you will see all names that start with lesson:”lesson_1, lesson_2, lesson_3 and so on. This way you will see all folders that start with “lesson”. Add _1 to “lesson” (lesson_1) and press enter. This way you don’t have to write out the whole name in the terminal but only a few letters supplemented with the TAB function. It is important that you understand this and can apply it. In the remainder of this course we will work with long file names and it is inconvenient to type them in completely each time. It is also error prone. The Tab Complement automatically searches for the correct files.
NOTE: If tab-completion doesn’t work, it means that either the file doesn’t exist or you’re working in the wrong directory!
Data inspection using bash
Before we can clean and organize the data we have to know how the data is organized. The first step of the data analysis workflow is data inspection.
Bash: gzip
- Go to directory lesson_2 (with the
cd
command)
Most genomic data are zipped to save space. A zipped file has a .gz extension. Before we can inspect and clean the data file we have to unzip the file. To unzip a data file we use the gzip
command:
- To unzip a file we use
gzip
in combination with the -d option ( = decompress)
gzip -d Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3.gz
NOTE: use tab-completion to fill out the file name of Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3.gz. Just enter the first 2-3 letters of the file name and press tab. The rest of the file name will be filled out automatically. Be sure to be in the right folder.
NOTET: if we want to zip a file we use gzip
in combination with the -9 option ( = compress better)
gzip -9 Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3
Bash: rename a file with mv
To rename a file using the bash terminal we will use the mv
command. The mv
command is used to move a file to another directory. But it can also be used to rename a file by adding a new file name
mv old_filename new_filename
We will rename file Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3.gz into lesson_02_opdracht5_Hpylori.gff3
mv Helicobacter_pylori_gca_000802455.ASM80245v1.49.gff3 lesson_02_opdracht5_Hpylori.gff3
Bash: word count
How do we inspect a data file on the server? There are a couple of bash commands which are very handy for data inspection
To count the number of lines of a file we use the wc
command (word count). The wc
command counts the number of characters, words and lines of a file. To count the number of lines we use the -l (lines) option:
wc -l lesson_02_opdracht5_Hpylori.gff3
lesson_02_assignment
How many lines are present in data file lesson_02_opdracht5_Hpylori.gff3
Bash: head and tail
The head
command shows by default the first 10 lines of a file:
head lesson_02_opdracht5_Hpylori.gff3
If we want to change the number of lines shown we can add the -n option followed by a number. To show the first 50 lines of a data file:
head -n50 lesson_02_opdracht5_Hpylori.gff3
The tail
command shows by default the last 10 lines of a file. Similar as with the head
command we can use the -n option to change the default setting. To show the last 100 lines of a file:
tail -n100 head lesson_02_opdracht5_Hpylori.gff3
bash: combining commands in a pipeline (part 1)
It is also possible to combine bash commands.
The output of the first command serves as input for the second command
data -> command1 -> output1 -> command2 -> output2
The sequential use of bash commands is called a bash pipe or pipeline and is very common in bash programming. To make a bash pipeline we use the pipe symbol also known as the vertical line (|)
Generally to make a bash pipeline use the following syntax
command1 -options file | command2 -options
To exactly show line 76 of data file lesson_02_opdracht5_Hpylori.gff3 we can first select the first 76 lines with the head
command and use that as an input for the tail
command to only show the last line.
head -n76 lesson_02_opdracht5_Hpylori.gff3 | tail -n1
lesson_02_assignment
What is the bash code to show line 76 and 79 of the data file lesson_02_opdracht5_Hpylori.gff3
lesson_02_assignment
Line 76 to 79 shows the start end end coordinates of the mRNA, exon and CDS of a bacterial gene. Why is there no difference in length between “gene”, “mRNA”, “exon” and “CDS“?
Bash: less
If we want to scroll through a file we can use the less
command. The -N option numbers the lines while scrolling
less -N lesson_02_opdracht5_Hpylori.gff3
- If you press enter the file will move one line
- If you press the space bar or page up/down all lines shown in the window will move
- To leave the
less
command press q
lesson_02_assignment
Use the less
command to scroll down the Helicobacter_pylori data file. The first 67 lines start with a pound sign (# sign) and consist of four columns with genome information. The first line with gene information is line 68. Scroll to line 68
- How many columns are in line 68?
- Which columns contain the start and the end position of the gene?
Data organisation and transformation using bash
After data file inspection we should have a good idea about how the data is organized and how we can clean the data file. How we organize our data file depends on the research question. What do we want to calculate. In assignment_6 we concluded that bacterial genes don’t have introns. Thus genes are much smaller in bacteria compared to eukaryotic genomes. An interesting question is to calculate whether gene transcripts differ in length between pro- and eukaryotic genomes?
“eu- or prokaryote” (= experimental variable) -> genome (model) -> “Transcript length (= outcome variable)
If we want to calculate the average transcript length (mRNA) we only want lines and columns with mRNA information (for example line 69). To select specific lines and columns in a data file we can use the bash grep
and the cut
commands
Bash: grep
The grep
function searches for patterns in each line of a data file and print out the lines containing the pattern on the screen. The grep
commands has many options and can search for very complicated patterns. More on that later in the course
Generally, the grep
command works as follows:
grep "pattern" file
It is also possible to select lines not containing the pattern with the -v option (select non-matching lines)
grep -v "pattern" file
To select lines containing mRNA:
grep -i "mRNA" lesson_02_opdracht5_Hpylori.gff3
We have added the -i option. Because on forehand we don’t know whether the word mRNA is written as mrna, mRNA, MRNA or perhaps MrNa. Because bash is case sensitive and we want to capture all lines containing the word “mRNA” we use the -i option which make the search case insensitive. In other words, it matches all variants of the pattern “mRNA” with or without capital letters
The output is printed to the screen. That is not very useful. If we want the output in a new file we have to re-direct the output to a new file. We can do that with the > sign followed by the name of the new file
grep -i "mRNA" lesson_02_opdracht5_Hpylori.gff3 > lesson_02_opdracht10_Hpylori_filtered_1.gff3
A new file has been created in your working directory (should be lesson_2). NOTE: The file name has changed from opdracht5 to opdracht10 in the file name!
- Use the
head
command to check whether each line contain mRNA.
Next we use the data file lesson_02_opdracht10_Hpylori_filtered_1.gff3 to get rid of columns that will not be used in the JASP analysis
Bash: cut
To select columns of a data file use the cut
command with the -f option ( = field option). The most relevant columns for our data analysis are the first 5 columns. To select the first five columns of our data file:
cut -f1,2,3,4,5 lesson_02_opdracht10_Hpylori_filtered_1.gff3 > lesson_02_opdracht10_Hpylori_filtered_2.gff3
- Use the
head
command to check whether there are only 5 columns left in your data file lesson_02_opdracht10_Hpylori_filtered_2.gff3 (Figure 29)
bash: combining commands in a pipeline (part 2)
We have cleaned the GFF3 data file with the bash grep
and the cut
commands. In the process of data cleaning we produced an intermediate file Hpylori_filtered_1. To remove this file use the bash rm
command
rm lesson_02_opdracht10_Hpylori_filtered_1.gff3
To avoid creating extra files (which we don’t use any further) we can combine the grep
and the cut
commands in a bash pipeline:
grep -i "mRNA" lesson_02_opdracht5_Hpylori.gff3 | cut -f1,2,3,4,5 > lesson_02_opdracht10_Hpylori_filtered.gff3.txt
NOTE: we added txt to the file name. Otherwise it will not be recognized by JASP in the next assignment!
- Also remove file lesson_02_opdracht10_Hpylori_filtered_2.gff3
- Continue with file lesson_02_opdracht10_Hpylori_filtered.gff3.txt
There is one more thing we have to fix. We have to add columns names to the output file. JASP by default expects column names. To add column names to the file lesson_02_opdracht10_Hpylori_filtered.gff3.txt run the following bash command (you don’t have to understand this codes for now)
sed -i '1s/^/contig\tsource\ttype\tstart\tend\n/' lesson_02_opdracht10_Hpylori_filtered.gff3.txt
- Export file lesson_02_opdracht10_Hpylori_filtered.gff3.txt to your computer
- Import the data file in JASP
lesson_02_assignment
- Calculate the average transcript (mRNA) length of genes in the Helicobacter pylori genome
- Compare your result to the average transcript length of human genes (see assignment 3?
- What is your conclusion?
- In JASP, save the file as lesson_02_opdracht10_Hpylori_filtered.gff3.jasp
- Upload the file to your home folder -> lesson_2
Thus, based on the answers of the two above questions we have discovered that the bacterial genome of Helicobacter pylori is differently organised. There is actually a third reason. There are hardly any large regions of non-coding DNA between genes; genes are placed very close together in the bacterial genome.
Bash: script
We have observed that the genome of Helicobacter pylori is differently organized compared to the human genome. But this is only one bacterial genome. The genome of Helicobacter pylori could perhaps not be representative of other bacterial genomes. Here, we provide a bash script to automate the data file cleaning + calculation of the average mRNA length of multiple input files.
In bash, if you want to repeat the same analysis you’ll have to type the bash commands again. To avoid retyping bash command, we write a bash script. A bash script is a file which contains all the different steps you want to execute. The name of a bash script always end with .sh. This way we only have to run the bash script and the terminal will execute all the different bash commands:
IMPORTANT: To record all the steps of your analysis in a script is the essence of reproducible science. Everyone can see which data analysis steps you performed on your data and by sharing your script and the original input data file, other people can reproduce your data analysis!
Generally, to execute a bash script:
bash name_of_bash_script.sh
To store the output of a script in a new file:
bash name_of_bash_script.sh > name_of_file
lesson_02_assignment
In the shared directory lesson_2 are two bash scripts
- Copy these scripts to your home directory
- Run these scripts in the terminal
What is the output of the scripts?
In the shared folder, directory lesson_2 is a sub-directory called assignment_12
- Go to the shared directory and inspect the contents of directory assignment_12
- Copy all files from the shared directory to your home directory (to copy a directory + contents use the -r option)
The script3.sh file contains instructions to
- automatically read a GFF3 file (data import)
- clean the GFF3 file (data cleaning and organisation)
- calculate the average mRNA length (descriptive statistics)
The complete code can be seen in Figure 30.
lesson_02_assignment
- Run script3.sh in the terminal
What is your conclusion?