Quality Control & Filtering

Lecture notes

Quality control and filtering (pdf)

Exercise

We going to do the quality control and then filter the fastq files accordingly. Please work in groups of two and chose one of the dataset. When you are finished you can use the next. During the exercise fill your comments directly in the GoogleDoc file using the link here.

Log in our server and go to the folder TU/QC, where you find all data for this exercise.

ssh studentX@gdcsrv1.ethz.ch

cd TU/QUAL

Additional Information to the fastq datasets:

A: example_A.fq.gz: RNAseq data, HiSeq4000

B: example_B.fq.gz: DNAseq, HiSeq4000

C: example_C.fq.gz: ddRAD data, HiSeq2500

D: example_D.fq.gz: Amplicon Sequencing, MiSeq

E: example_E.fq.gz: Amplicon Sequencing, HiSeq2000

F: example_F.fq.gz: DNAseq data, HiSeq2000

G: example_G.fq.gz: DNAseq data, unknown

H: example_H.fq.gz: Metagenomics, Hiseq2500

I: example_I.fq.gz: DNAseq (10X), NovaSeq

1.Quality check

With to following command you can run fastqc and a html file will be generated.

fastqc example_A.fq.gz

We have provided the fastq screen outputs already (example_A_screen.html). You can copy (cyberduck) the html files to your computer and open them in your browser.

(1) What do you think about the data? Is there a problem?

2.Quality filtering

Depending on the problem that you have found in your dataset you can filter the reads using BBDuk from the bbmap. Below you find some important commands. If you need more information about bbduk just type bbduk.sh -h.

Find a list of all Illumina adapters here

(1) With the following command you remove bases with quality below Q15 from both sides and retain only reads that are longer than 100 bp.

bbduk.sh -in=example_fq.gz -out=example_trim.fq qtrim=rl trimq=15 minlength=100

And/or you can for example remove adapters (AGAGCACACGTCTGAACTCCAGTCACTGACCAATC) as following.

bbduk.sh -in=example_fq.gz -out=example_trim.fq literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC

If you expect only partial hits with the adapter (is most often the case) you can set a certain kmer length.

bbduk.sh -in=example_fq.gz -out=example_trim.fq literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC ktrim=r k=23 mink=11

(2) Now you can rerun fastqc to verify if you could filter your reads.

(3) For other tools there is no need to specify if you like to remove adapters or do a quality trimming, everything will be performed in one goal. Fastp is still under development but has many powerful options.

fastp -i example2_fq.gz -o example2_trim.fq

Additional information