Alignment
Learning Objectives
Main
◇ Know the differences between alignment and read mapping.
◇ Know how to preform read mapping.
Minor
◇ Know how to predict read coverage.
◇ Get an graphical impression of mapping.
Lecture Notes
⬇︎ Alignment
Challenges
Multiple sequence alignments (MSA)
In our first challenge we will perform a multiple sequence alignment of different SARS-CoV-2 strains.
There are many tools available to align multiple fasta sequences. We are going to use a webtool as there are only few sequences and the genome of the virus is pretty small. Download the fasta file using this link here.
❖ Challenge #1 How many strains do we have?
Suggestion #1
grep ">" -c SARS-CoV-2.fasta
or count ">" in your editor.
We are going to use mafft which runs at EMBL. Just use all default settings and upload the fasta file.
❖ Challenge #2 Have a look at the alignment. What is the difference between an alignment and a fasta file?
❖ Challenge #3 Have a look at the Neighbour-joining tree? What do you think about the topology?
Coverage Calculation
❖ Challenge #4 We are going to generate SNPs data in a non-model species using Illumina short reads. Use the coverage calculator and find out how many samples you can run in 1 SP flow cell (samllest physical unite).
Platform: NovaSeq 6000
Protocol: PE 2X150 bp
Genome size (2n): 0.600 Gb
Read mapping
For our next challenge we going to use a reduced data set containing different strains of the parasite Crithidia bombi infecting bumblebees. Find more information here.
We have 4 samples ERR3418932, ERR341896, ERR3418937 and ERR3418936 from different strains. To speed up the analysis we are going to use only 1 scaffold of the Crithidia bombi genome. Find more information about the genome here.
Login to our server.
ssh guest?@gdc-vserver.ethz.ch
Let's download the data and open the tar file.
cd ${HOME}
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/mapping.tar.gz
tar xvzf mapping.tar.gz && rm mapping.tar.gz
cd mapping
In your folder you will find the following files.
raw reads (e.g. ERR3418932_1.fq.gz and ERR3418932_2.fq.gz)
reference genome (Ref_C_bombi.fasta)
annotation e.g. where genes are located (Ref_C_bombi.gff)
First, we need to index the reference.
bwa-mem2 index Ref_C_bombi.fasta
❖ Challenge #5 Why we need to index the reference?
Suggestion #5
Indexing the reference genome massively improves speed and efficiency, making read mapping feasible for large-scale sequencing data.
In order to make our code easier to read (e.g. more reproducible) we define variables.
sample=ERR3418937
Ref=Ref_C_bombi.fasta
We map reads to the reference using the default settings of bwa-mem2 and add a read group and output a sam file.
bwa-mem2 mem -R "@RG\tID:${sample}\tSM:${sample}\tPL:Illumina" ${Ref} ${sample}_1.fq.gz ${sample}_2.fq.gz > ${sample}.sam
❖ Challenge #6 Why we should never keep sam files? And what is a read group?
Suggestion #6
We compress sam to bam files to reduced disk space. A read group corresponds in many cases to the sample names.
Now, let's sort the sam file and compress it.
samtools sort ${sample}.sam -o ${sample}_sort.bam
❖ Challenge #7 Check out the mapping statistics. How many reads could you map?
samtools flagstat ${sample}_sort.bam
flagstats output
Flagstats | Information |
---|---|
total | All reads |
mapped | All mapped reads |
paired in sequencing | R1 and R2 could be mapped |
properly paired | R1 and R2 could be mapped within insert range |
with itself and mate mapped | R1 and R2 could be mapped also outside insert range |
singletons | Either R1 or R2 could be mapped |
mate mapped to a different chr | R1 or R2 could be mapped on different chromosomes |
Now, we remove reads with mapping quality lower than 20 to get rid of ambiguous alignments.
samtools view -bh -q 20 -F 0x800 -F 0x100 ${sample}_sort.bam > ${sample}_sort_Q20.bam
❖ Challenge #8 What does the flags 0x800 and 0x100 mean? Use the Sam flag decoder.
Suggestion #8
We remove supplementary and not primary alignment, which are low quality hits.
To account possible PCR duplicates we remove them and index the final bam file.
samtools collate -O -u ${sample}_sort_Q20.bam | samtools fixmate -m -u - - | samtools sort -u - | samtools markdup -r -f dup_${sample} - ${sample}_sort_Q20_fix_nodup.bam
samtools index ${sample}_sort_Q20_fix_nodup.bam
❖ Challenge #9 Check again the mapping statistics to find out how many reads could have been mapped? What is the mapping rate?
Let's calculate the mean coverage.
samtools coverage ${sample}_sort_Q20_fix_nodup.bam
If you like to have a more detailed report you can use qualimap.
qualimap bamqc -bam ${sample}_sort_Q20_fix_nodup.bam
❖ Challenge #10 Transfer the qualimap folder to your local computer and open the HTML file in your browser. How meaningful are mean values in this context?
To keep our folder tidy, let's delete the intermediate files.
rm ${sample}.sam ${sample}_sort.bam* ${sample}_sort_Q20.bam*
As we have now 3 more samples to map we are going to use bash loop to process all the samples in one command.
First, we generate a sample list and remove the endings.
ls *_1.fq.gz | sed 's/_1.fq.gz//' > sample.lst
Then we use a loop to map all the samples in sample.lst
#!/bin/bash
##define variable
Ref=Ref_C_bombi.fasta
##bash loop to map reads against the genome
for sample in $(cat sample.lst)
do
bwa-mem2 mem -R "@RG\tID:${sample}\tSM:${sample}\tPL:Illumina" ${Ref} ${sample}_1.fq.gz ${sample}_2.fq.gz > ${sample}.sam
samtools sort ${sample}.sam -o ${sample}_sort.bam
samtools index ${sample}_sort.bam
samtools flagstat ${sample}_sort.bam > ${sample}_sort.stats
samtools view -hb -q 20 -F 0x800 -F 0x100 ${sample}_sort.bam > ${sample}_sort_Q20.bam
samtools index ${sample}_sort_Q20.bam
samtools collate -O -u ${sample}_sort_Q20.bam | samtools fixmate -m -u - - | samtools sort -u - | samtools markdup -r -f dup_${sample} - ${sample}_sort_Q20_fix_nodup.bam
samtools index ${sample}_sort_Q20_fix_nodup.bam
samtools flagstat ${sample}_sort_Q20_fix_nodup.bam > ${sample}_sort_Q20_fix_nodup.stats
samtools coverage ${sample}_sort_Q20_fix_nodup.bam > ${sample}_sort_Q20_fix_nodup.cov
rm ${sample}.sam ${sample}_sort.bam* ${sample}_sort_Q20.bam*
done
Let's copy the script above in a text file and call it mapping.sh
. Add comments to make it reproducible.
We need then to make the file executable.
chmod +x mapping.sh
And execute our bash script.
./mapping.sh
❖ Challenge #11 Have a look at the mapping statistics. What do you think about the mapping rates across the samples (check out _sort.stats
and _sort_Q20_nodup.stats
)?
Suggestion #11
| Sample | total | mapped | % | After filtering | % |Coverage|
|------------|--------|--------|------|-----------------|------|--------|
| ERR3418932 | 130943 | 122626 | 0.94 | 108009 | 0.83 | 90 |
| ERR3418936 | 126253 | 123034 | 0.97 | 110916 | 0.88 | 90 |
| ERR3418937 | 98889 | 77515 | 0.79 | 67488 | 0.68 | 90 |
| ERR3418961 | 106936 | 4755 | 0.04 | 3525 | 0.03 | 15 |
grep "in total (" *_sort.stats
grep "primary mapped (" *_sort.stats
grep "primary mapped (" *_Q20_nodup.stats
awk FNR!=1 *cov|cut -f6 *cov
Visualization
❖ Challenge #12 How noisy are the mappings? Get a graphical impression. Make some print screens.
First, we need to index the fasta file.
samtools faidx Ref_C_bombi.fasta
Download and install IGV on your own computer use the link here.
Transfer the mappings (bams), the reference (fasta and the Index) and the annotation (gff) file.
Open IGV and import the fasta file under "Genomes" > "Load Genomes from file .." Drag and drop the bam and the gff file.