Quality control and filtering¶

Lecture notes¶

Exercise¶

Please work in groups of two. During the exercise fill your comments directly in the GoogleDoc file using the link here.

The datasets can be downloaded directly using curl in the terminal. But make sure you are in the working directory.

# This is the place were the data will be downloaded: 
pwd
cd your/working/directory # change this accordingly
# download the sequence examples files
curl -O https://gdc-docs.ethz.ch/UniZH/psc_graduate_course_FS_2019/data/sequence_data_examples.zip
# unzip the example files
unzip sequence_data_examples.zip

Now, also download fastq screen examples located at the same place but called: screen_examples.zip

1. Fastq format¶

In the first part of the exercise we have again a look at the fastq files, one of the most important NGS sequence file format. Have a look at the example1_fq.gz file. With the following command you can directly display a zipped file in the terminal.

less example1_fq.gz

(1) How long are the reads? (Hint: just open the file in the terminal)

(2) What is the quality of the first three bases of the second read?

(3) Which internal barcode has been used for sequencing the present sample?

(4) How many reads do we have? (Hint: use a combination of zcat and grep)

(5) Copy the first two sequence in a new file shorten them to 100bp, set the quality of each base to 30 and replace the unique instrument name by N01006. Run fastqc to check your self-made fastq file.

fastqc example1_self.fq

2. Quality filtering¶

In the second part of the exercise we have a collection of different fastq files. The aim is first to check the quality and find out if there is a problem and then filter the fastq file accordingly.

Additional Information to the fastq datasets:

A: example2_A.fq.gz: RNA-seq data, HiSeq4000

B: example2_B.fq.gz: DNA-seq, HiSeq4000

C: example2_C.fq.gz: ddRAD data, HiSeq2500

D: example2_D.fq.gz: Amplicon Sequencing, MiSeq

E: example2_E.fq.gz: Amplicon Sequencing, HiSeq2000

F: example2_F.fq.gz: DNA-seq data, HiSeq2000

G: example2_G.fq.gz: DNA-seq data, unknown

H: example2_H.fq.gz: Metagenomics, Hiseq2500

(6) With to following command you can run fastqc and a html file will be generated.

fastqc example2_A.fq.gz

We have provided the fastq screen outputs already (example2_A_screen.html). You can open the html files in your browser.

(7) Depending on the problem that you have found in your dataset you can filter the reads using BBDuk from the bbmap. Below you find some important commands. If you need more information about bbduk just type

bbduk.sh -h

Find a list of all Illumina adapters here

With the following command you remove bases with quality below Q15 from both sides and retain only reads that are longer than 100 bp.

bbduk.sh -in=example2_fq.gz -out=example2_trim.fq qtrim=rl trimq=15 minlength=100

And/or you can for example remove adapters (AGAGCACACGTCTGAACTCCAGTCACTGACCAATC) as following.

bbduk.sh -in=example2_fq.gz -out=example2_trim.fq literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC

If you expect only partial hits with the adapter (is most often the case) you can set a certain kmer length.

bbduk.sh -in=example2_fq.gz -out=example2_trim.fq literal=AGAGCACACGTCTGAACTCCAGTCACTGACCAATC ktrim=r k=23 mink=11

(8) Now you can rerun fastqc to verify if you could filter your reads. How many reads are retained?

Quality control and filtering¶

Lecture notes¶

Exercise¶

1. Fastq format¶

2. Quality filtering¶

Additional information¶

Fastq Sequence Format¶

Data Submission¶

Data Manipulations¶