SNPs

Lecture Notes

⬇︎ SNPs

Challenges

Post your comments, questions and ideas directly on the white board.

Multiple sequence alignments (MSA)

In our first challenge we will perform a multiple sequence alignment of different SARS-CoV-2 strains.

There are many tool available to align multiple fasta sequences. We are going to use a webtool as there are only few sequences and the genome of the virus is pretty small. Download the fasta file using this link here.

(1) How many strains do we have?

Suggestion

grep ">" -c SARS-CoV-2.fasta

or count ">" in your editor.

We are going to use mafft which runs at EMBL. Just use all default settings and upload the fasta file.

mafft

other MSA tool

(2) Have a look at the alignment. What is the difference between an alignment and a fasta file?

(3) Have a look at the Neighbour-joining tree? What do you think about the topology?

Variants

Variant types

Below you find the most common used variant types. For most downstream analysis only SNPs can be used.

Type	Ref	Alt
SNP	A	T
Insertion	A	AT
Deletion	AT	A
Indel	Insertion or deletion
MNP	AT	GC
CLUMPED	ATTTT	GTTTC

Find below some putative SNPs/variants. Is it a SNP or not and why?

Suggestion

A clear SNP as sufficient forward and reverse reads support it.

Suggestion

Could be an indel but it is not well supported.

Suggestion

Not a clear SNP as the alternative allele of the SNP is not well supported.

Suggestion

Not clear as the SNPs are supported by only 1 read.

Variant calling

For our next challenge we going to use a reduced data set containing different strains of the parasite Crithidia bombi infecting bumblebees. Find more information here.

We have 4 samples ERR3418932, ERR341896, ERR3418937 and ERR3418936 form different strains. To speed up the analysis we are going to use only 1 scaffold of the Crithidia bombi genome. Find more information about the genome here.

Login to our server.

ssh studentX@gdcsrv2.ethz.ch

Generate a new working folder, open it, download the data and open the tar file.

mkdir SNPs
cd SNPs
wget https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/SNP.tar.gz 
tar xvzf SNP.tar.gz 
ls

In your folder you will find the following files.

raw reads (e.g. ERR3418932_1.fq.gz and ERR3418932_2.fq.gz)
reference genome (Ref_C_bombi.fasta)
annotation e.g. where the genes are located (Ref_C_bombi.gff)

The aim is to first map the reads against the genome, do a variant calling, filter the variants and then check out manually some of the SNPs.

Mapping

First, we need to index the reference.

bwa index Ref_C_bombi.fasta

In order to make our code easier to read (e.g. more reproducible) we define variables.

sample=ERR3418937
Ref=Ref_C_bombi.fasta

We map reads to the reference using the default settings of bwa mem, add a read group and output a bam file.

bwa mem -R "@RG\tID:${sample}\tSM:${sample}\tPL:Illumina" ${Ref} ${sample}_1.fq.gz ${sample}_2.fq.gz |samtools view -b - > ${sample}.raw.bam

Now, we need to sort bam file.

samtools sort ${sample}.raw.bam -o ${sample}.sort.bam

(1) Check out the mapping statistics. How many reads could you map?

samtools flagstat ${sample}.sort.bam

Flagstat-output

Flags	What does that mean?
total	Total reads
mapped	All mapped reads
paired in sequencing	R1 and R2 could be mapped
properly paired	R1 and R2 could be mapped within insert range
with itself and mate mapped	R1 and R2 could be mapped also outside insert range
singletons	Either R1 or R2 could be mapped
mate mapped to a different chr	R1 or R2 could be mapped on different chromosomes

If entire or patial reads can be mapped on different locations the read gets duplicated when using the default settings of BWA. If you have many more reads than in your fastq file then you have a problem with the genome. Such multi-mapping reads will be removed during quality mapping quality filtering. Most of the reads should be properly paired, if not there is a problem with the reference used. If you do a mapping quality filtering, you need to extract the total number of reads from the unfiltered bam file.

Now, we remove reads with mapping quality lower than 20 to get rid of ambiguous alignments.

samtools view -bh -q 20 ${sample}.sort.bam > ${sample}.sort.Q20.bam

To account possible PCR duplicates we remove them and index the final bam file.

samtools rmdup ${sample}.sort.Q20.bam ${sample}.sort.Q20.nodup.bam
samtools index ${sample}.sort.Q20.nodup.bam

(2) Check again the mapping statistics to find out how many reads could have been mapped? What is the mapping rate?

To keep our folder tidy, let's delete the intermediate files.

rm ${sample}.raw.bam ${sample}.sort.bam ${sample}.sort.Q20.bam

As we have now 3 more samples to map we are going to use bash loop to process all the samples in one command.

First, we generate a sample list and remove the endings.

ls *_1.fq.gz|sed 's/_1.fq.gz//' > sample.list

Then we use a loop to map all the samples in sample.list

#!/bin/bash

##define variable
Ref=Ref_C_bombi.fasta

##get sample list
##ls *_1.fq.gz|sed 's/_1.fq.gz//' > sample.list

##bash loop to map reads against the genome
for sample in $(cat sample.list)
do
    bwa mem -R "@RG\tID:${sample}\tSM:${sample}\tPL:Illumina" ${Ref} ${sample}_1.fq.gz ${sample}_2.fq.gz \
    |samtools view -b - > ${sample}.raw.bam
    samtools sort ${sample}.raw.bam -o ${sample}.sort.bam
    samtools flagstat ${sample}.sort.bam > ${sample}.sort.stats
    samtools view -hb -q 20 ${sample}.sort.bam > ${sample}.sort.Q20.bam
    samtools rmdup ${sample}.sort.Q20.bam ${sample}.sort.Q20.nodup.bam
    samtools index ${sample}.sort.Q20.nodup.bam
    samtools flagstat ${sample}.sort.Q20.nodup.bam > ${sample}.sort.Q20.nodup.stats
    rm ${sample}.raw.bam ${sample}.sort.bam ${sample}.sort.Q20.bam
done

Let's copy the script above in a text file and call it mapping.sh. Add comments. We need then to make the file executable.

chomd +x mapping.sh

And execute our bash script.

./mapping.sh

(3) Have a look at the mapping statistics. What do you think about the mapping rates across the samples (check out .sort.stats and .sort.Q20.nodup.stats)?

Suggestion

| Sample     | total  | mapped | %    | After filtering | %    |
|------------|--------|--------|------|-----------------|------|
| ERR3418932 | 130943 | 122767 | 0.94 | 108224          | 0.83 |
| ERR3418936 | 126253 | 123067 | 0.97 | 110527          | 0.88 |
| ERR3418937 | 98889  | 77652  | 0.79 | 67609           | 0.68 |
| ERR3418961 | 106936 | 4798   | 0.04 | 2933            | 0.03 |

grep "in total (" *.sort.stats 
grep "mapped (" *.sort.stats
grep "mapped (" *.Q20.nodup.stats

SNP calling

First, we need to generate a list with all bam files as we like to do a joint calling.

ls -1 *.sort.Q20.nodup.bam > bamfiles.txt

Let's now run FreeBayes using the default settings.

freebayes -p 2 -L bamfiles.txt -v raw.vcf -f ${Ref}

(1) How many SNPs and other variants do you have in total?

We count the number of lines but exclude lines starting with #.

cat raw.vcf| grep -v '^#' | wc -l

(2) Have a look at the SNP table. Try to understand the vcf file format. What are the genotypes of the 4 samples on scaffold3_8_ref0000002_ref0000084 and position 1760.

Suggestions

    scaffold3_8_ref0000002_ref0000084    1760
    ERR3418961   NA
    ERR3418937   TT
    ERR3418936  CC
    ERR3418932  CT

SNP filtering

In order to filter false positive SNPs we are going to apply some hard filters. The filters presented here is only a subset. When we are working with RAD SNPs, we will see more.

Set sites with less than 3 reads to NA and keep only sites with a minimum mapping quality of 20.

vcftools --vcf raw.vcf --minQ 20 --minDP 3  --recode --recode-INFO-all --out raw_Q20_minDP3

(1) How many variants are you retaining?

Remove sites with more than 5% missing sites.

vcftools --vcf raw_Q20_minDP3.recode.vcf --max-missing 0.95 --recode --recode-INFO-all --out raw_Q20_minDP3_NA95

(2) How many variants do you retain?

Let's decompose haplotypes into single SNPs.

vcfallelicprimitives -g -k raw_Q20_minDP3_NA95.recode.vcf > raw_Q20_minDP3_NA95_SNPs.vcf

Finally, we only keep biallelic sites and remove indels.

vcftools --vcf raw_Q20_minDP3_NA95_SNPs.vcf --min-alleles 1 --max-alleles 2  --remove-indels --out raw_Q20_minDP3_NA95_SNPs_noIndels

(3) What is the final numbers of SNPs in your table?

Manual inspection of the SNPs

First, we need to index the fasta file.

samtools faidx Ref_C_bombi.fasta

Download and install IGV on your own computer use the link here.

Transfer the alignments (bams), the raw and the filtered SNP table (vcf), the reference (fasta and the Index) and the annotation (gff) file.

Open IGV and create a new genome.

Drag and drop the bam and vcf files. Can you verify some of the SNPs manually? Do you see differences between the filter and the raw vcf? Make some print screens and post them.

SNPs

Lecture Notes

Challenges

Multiple sequence alignments (MSA)

Variants

Variant types

Variant calling

Mapping

Flagstat-output

SNP calling

SNP filtering

Manual inspection of the SNPs

Additional Information

File formats

Alignments

SNPs