Quality Control & Filtering
Lecture notes¶
Quality control and filtering (pdf)
Exercise¶
We going to do the quality control and then filter the fastq files accordingly. Please work in groups of two and chose one of the dataset. When you are finished you can use the next. During the exercise fill your comments directly in the GoogleDoc file using the link here.
Log in our server and go to the folder TU/QC, where you find all data for this exercise.
ssh studentX@gdcsrv1.ethz.ch
cd TU/QUAL
Additional Information to the fastq datasets:
A: example_A.fq.gz: RNAseq data, HiSeq4000
B: example_B.fq.gz: DNAseq, HiSeq4000
C: example_C.fq.gz: ddRAD data, HiSeq2500
D: example_D.fq.gz: Amplicon Sequencing, MiSeq
E: example_E.fq.gz: Amplicon Sequencing, HiSeq2000
F: example_F.fq.gz: DNAseq data, HiSeq2000
G: example_G.fq.gz: DNAseq data, unknown
H: example_H.fq.gz: Metagenomics, Hiseq2500
I: example_I.fq.gz: DNAseq (10X), NovaSeq
1.Quality check¶
With to following command you can run fastqc and a html file will be generated.
fastqc example_A.fq.gz
We have provided the fastq screen outputs already (example_A_screen.html
). You can copy (cyberduck) the html files to your computer and open them in your browser.
(1) What do you think about the data? Is there a problem?
2.Quality filtering¶
Depending on the problem that you have found in your dataset you can filter the reads using BBDuk from the bbmap. Below you find some important commands. If you need more information about bbduk just type bbduk.sh -h
.
Find a list of all Illumina adapters here
(1) With the following command you remove bases with quality below Q15 from both sides and retain only reads that are longer than 100 bp.
bbduk.sh -in=example_fq.gz -out=example_trim.fq qtrim=rl trimq=15 minlength=100
And/or you can for example remove adapters (AGAGCACACGTCTGAACTCCAGTCACTGACCAATC) as following.
bbduk.sh -in=example_fq.gz -out=example_trim.fq literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC
If you expect only partial hits with the adapter (is most often the case) you can set a certain kmer length.
bbduk.sh -in=example_fq.gz -out=example_trim.fq literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC ktrim=r k=23 mink=11
(2) Now you can rerun fastqc to verify if you could filter your reads.
(3) For other tools there is no need to specify if you like to remove adapters or do a quality trimming, everything will be performed in one goal. Fastp is still under development but has many powerful options.
fastp -i example2_fq.gz -o example2_trim.fq