Quality Filtering (QF)

Learning Objectives

Main
◇ Be able to assess the quality of read data.
◇ Know the limitations of fastqc.
Minor
◇ Know how to clean read data.
◇ Know when you need to clean read data.

Challenges

We are going to do quality control and then filter the fastq files accordingly. When you are finished, you can use the next.

Login to our server.

ssh guest??@gdc-vserver.ethz.ch

Let's download the data and open the tar file and navigate to the working directory QF.

cd ${HOME}  
curl -O "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/QF.tar.gz" 
tar xvzf QF.tar.gz && rm QF.tar.gz 
cd QF
ls

Check first the integrity of the files.

md5sum -c md5sums.txt

Additional information to the datasets:

Example	File	Data Type	Platform
A	example_A.fq.gz	RNA-Seq	HiSeq4000
B	example_B.fq.gz	DNAseq	HiSeq4000
C	example_C.fq.gz	ddRAD	HiSeq2500
D	example_D.fq.gz	Amp-Seq	MiSeq
E	example_E.fq.gz	Amp-Seq	HiSeq2000
F	example_F.fq.gz	DNA-Seq	HiSeq2000
G	example_G.fq.gz	DNA-Seq	unknown
H	example_H.fq.gz	Metagenomics	HiSeq2500
I	example_I.fq.gz	DNA-Seq (10X)	NovaSeq
J	example_J.fq.gz	DNA-Seq (reverse reads)	NovaSeq
K	example_K.fq.gz	ancientDNA	MiSeq

Quality control

For the quality control we will use

Fastqc will provide you the famous read statistics whereas fastq screen aligns a subset of the sequences against some standard databases (e.g. ribosomal RNA, adapters, Human) to get quickly an idea what you have sequenced. Many sequencing facilities might provide you both outputs, therefore, it is important to have a closer look at them.

With the following command you can run fastqc and an html file will be generated.

fastqc example_A.fq.gz

We have provided the fastq screen outputs already (example_A_screen.html). You can copy (cyberduck) the html files to your computer and open them in your browser.

❖ Challenge #1: What do you think about the sample? Is there a problem?

Quality filtering

❖ Challenge #2: Now let's filter the reads according to the problem you found. Run fastqc again to see if you could fix the problem

Find a list of all Illumina adapters here.

We going to use BBDuk of the bbmap package. Below you find some important commands. If you need more information about bbduk just type bbduk.sh -h.

With the following command you remove bases with quality below Q15 from both sides and retain only reads that are longer than 100 bp.

bbduk.sh in=example_fq.gz out=example_trim.fq.gz qtrim=rl trimq=15 minlength=100

And/or you can for example remove adapters (AGAGCACACGTCTGAACTCCAGTCACTGACCAATC) as following.

bbduk.sh in=example_fq.gz out=example_trim.fq.gz literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC

If you expect only partial hits with the adapter (is most often the case) you can set a certain kmer length (the adapter needs to fit only partially).

bbduk.sh in=example_fq.gz out=example_trim.fq.gz literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC ktrim=r k=23 mink=11

With the following command you remove partial hits with adapters, trim low quality bases and keep only sequences > 100 bp.

bbduk.sh in=example_fq.gz out=example_trim.fq.gz literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC ktrim=r k=23 mink=11 qtrim=rl trimq=15 minlength=100

❖ Challenge #3 For other tools there is no need to specify if you like to remove adapters or do a quality trimming, everything will be performed in one goal. Fastp is still under development but has many powerful options and is really fast.

conda activate /usr/bin/condaenv/fastp
fastp -i exampleA_fq.gz -o exampleA_trim.fq.gz

Some suggestions will be available here.

Quality Filtering (QF)

Challenges

Quality control

Quality filtering

Additional Information