Alignment/Mapping

Learning Objectives

Main
◇ Know the differences between alignment and read mapping.
◇ Know how to preform read mapping.
Minor
◇ Know how to predict read coverage.
◇ Get an graphical impression of mapping.

Challenges

Multiple sequence alignments (MSA)

In our first challenge we will perform a multiple sequence alignment of different SARS-CoV-2 strains.

There are many tools available to align multiple fasta sequences. We are going to use a webtool as there are only few sequences and the genome of the virus is pretty small. Download the fasta file using this link here.

❖ Challenge #1 How many strains do we have?

Suggestion #1

grep ">" -c SARS-CoV-2.fasta

or count ">" in your editor.

We are going to use mafft which runs at EMBL. Just use all default settings and upload the fasta file.

mafft

other MSA tool

❖ Challenge #2 Have a look at the alignment. What is the difference between an alignment and a fasta file?

❖ Challenge #3 Have a look at the Neighbour-joining tree? What do you think about the topology?

Coverage Calculation

❖ Challenge #4 We are going to generate SNPs data in a non-model species using Illumina short reads. Use the coverage calculator and find out how many samples you can run in 1 SP flow cell (samllest physical unite).

Platform: NovaSeq 6000 
Protocol: PE 2X150 bp 
Genome size (2n): 0.600 Gb

Read mapping

For our next challenge we going to use a reduced data set containing different strains of the parasite Crithidia bombi infecting bumblebees. Find more information here.

We have 4 samples ERR3418932, ERR341896, ERR3418937 and ERR3418936 from different strains. To speed up the analysis we are going to use only 1 scaffold of the Crithidia bombi genome. Find more information about the genome here.

Login to our server.

ssh guest?@gdc-vserver.ethz.ch

Let's download the data and open the tar file.

cd ${HOME}  
curl -O https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/mapping.tar.gz 
tar xvzf mapping.tar.gz && rm mapping.tar.gz 
cd mapping

In your folder you will find the following files.

raw reads (e.g. ERR3418932_1.fq.gz and ERR3418932_2.fq.gz)
reference genome (Ref_C_bombi.fasta)
annotation e.g. where genes are located (Ref_C_bombi.gff)

First, we need to index the reference.

bwa-mem2 index Ref_C_bombi.fasta

❖ Challenge #5 Why we need to index the reference?

Suggestion #5

Indexing the reference genome massively improves speed and efficiency, making read mapping feasible for large-scale sequencing data.

In order to make our code easier to read (e.g. more reproducible) we define variables.

SAMPLE=ERR3418937
REF=Ref_C_bombi.fasta

We map reads to the reference using the default settings of bwa-mem2 and add a read group and output a sam file.

bwa-mem2 mem -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:Illumina" ${REF} ${SAMPLE}_1.fq.gz ${SAMPLE}_2.fq.gz > ${SAMPLE}.sam

❖ Challenge #6 Why we should never keep sam files? And what is a read group?

Suggestion #6

We compress sam to bam files to reduced disk space. A read group corresponds in many cases to the sample names.

Now, let's sort the sam file and compress it.

samtools sort ${SAMPLE}.sam -o ${SAMPLE}_sort.bam

❖ Challenge #7 Check out the mapping statistics. How many reads could you map?

samtools flagstat ${SAMPLE}_sort.bam

flagstats output

Flagstats	Information
total	All reads
mapped	All mapped reads
paired in sequencing	R1 and R2 could be mapped
properly paired	R1 and R2 could be mapped within insert range
with itself and mate mapped	R1 and R2 could be mapped also outside insert range
singletons	Either R1 or R2 could be mapped
mate mapped to a different chr	R1 or R2 could be mapped on different chromosomes

Now, we remove reads with mapping quality lower than 20 to get rid of ambiguous alignments.

samtools view -bh -q 20 -F 0x800 -F 0x100 ${SAMPLE}_sort.bam > ${SAMPLE}_sort_Q20.bam

❖ Challenge #8 What does the flags 0x800 and 0x100 mean? Use the Sam flag decoder.

Suggestion #8

We remove supplementary and not primary alignment, which are low quality hits. The numbers here are presented as hexadecimal numbers. Q20 as decimal number.

To account possible PCR duplicates we remove them and index the final bam file.

samtools collate -O -u ${SAMPLE}_sort_Q20.bam | samtools fixmate -m -u - - | samtools sort -u - | samtools markdup -r -f dup_${SAMPLE} - ${SAMPLE}_sort_Q20_fix_nodup.bam 
samtools index ${SAMPLE}_sort_Q20_fix_nodup.bam

❖ Challenge #9 Check again the mapping statistics to find out how many reads could have been mapped? What is the mapping rate?

Let's calculate the mean coverage.

samtools coverage ${SAMPLE}_sort_Q20_fix_nodup.bam

If you like to have a more detailed report you can use qualimap.

qualimap bamqc -bam ${SAMPLE}_sort_Q20_fix_nodup.bam

❖ Challenge #10 Transfer the qualimap folder to your local computer and open the HTML file in your browser. How meaningful are mean values in this context?

To keep our folder tidy, let's delete the intermediate files.

rm ${SAMPLE}.sam ${SAMPLE}_sort.bam* ${SAMPLE}_sort_Q20.bam*

As we have now 3 more samples to map we are going to use bash loop to process all the samples in one command.

First, we generate a sample list and remove the endings.

ls *_1.fq.gz | sed 's/_1.fq.gz//' > sample.lst

Then we use a loop to map all the samples in sample.lst

#!/bin/bash

##define variable
REF=Ref_C_bombi.fasta


##bash loop to map reads against the genome
for SAMPLE in $(cat sample.lst)
do
    bwa-mem2 mem -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:Illumina" ${REF} ${SAMPLE}_1.fq.gz ${SAMPLE}_2.fq.gz > ${SAMPLE}.sam
    samtools sort ${SAMPLE}.sam -o ${SAMPLE}_sort.bam
    samtools flagstat ${SAMPLE}_sort.bam > ${SAMPLE}_sort.stats
    samtools view -hb -q 20 -F 0x800 -F 0x100 ${SAMPLE}_sort.bam > ${SAMPLE}_sort_Q20.bam
    samtools index ${SAMPLE}_sort_Q20.bam
    samtools collate -O -u ${SAMPLE}_sort_Q20.bam | samtools fixmate -m -u - - | samtools sort -u - | samtools markdup -r -f dup_${SAMPLE} - ${SAMPLE}_sort_Q20_fix_nodup.bam 
    samtools index ${SAMPLE}_sort_Q20_fix_nodup.bam
    samtools flagstat ${SAMPLE}_sort_Q20_fix_nodup.bam > ${SAMPLE}_sort_Q20_fix_nodup.stats
    samtools coverage ${SAMPLE}_sort_Q20_fix_nodup.bam > ${SAMPLE}_sort_Q20_fix_nodup.cov
    rm ${SAMPLE}.sam ${SAMPLE}_sort.bam* ${SAMPLE}_sort_Q20.bam*
done

Let's copy the script above in a text file and call it mapping.sh. Add comments to make it reproducible. We need then to make the file executable.

chmod +x mapping.sh

And execute our bash script.

./mapping.sh

❖ Challenge #11 Have a look at the mapping statistics. What do you think about the mapping rates across the samples (check out _sort.stats and _sort_Q20_nodup.stats)?

Suggestion #11

Sample	total	mapped	%	After filtering	%	Coverage
ERR3418932	130943	122626	0.94	108009	0.83	90
ERR3418936	126253	123034	0.97	110916	0.88	90
ERR3418937	98889	77515	0.79	67488	0.68	90
ERR3418961	106936	4755	0.04	3525	0.03	15

To get the total number of reads, we are going to use the unfiltered BAM file to avoid mapping more reads than you have in your FASTQ file.

grep "in total (" *_sort.stats 
grep "primary mapped (" *_sort.stats
grep "primary mapped (" *_Q20_nodup.stats
awk FNR!=1 *cov|cut -f6 *cov

The remapping rate of the last sample is rather low. We will take another look at this sample again in the SNP section.

Visualization

❖ Challenge #12 How noisy are the mappings? Get a graphical impression. Make some print screens.

First, we need to index the fasta file.

samtools faidx Ref_C_bombi.fasta

Download and install IGV on your own computer use the link here.

Transfer the mappings (bams), the reference (fasta and the Index) and the annotation (gff) file.

Open IGV and import the fasta file under "Genomes" > "Load Genomes from file .." Drag and drop the bam and the gff file.

Alignment/Mapping

Challenges

Multiple sequence alignments (MSA)

Coverage Calculation

Read mapping

Visualization

Additional Information

File formats

Alignment