NGS

FastQ Exercises

Download the some example data for the quality control exercise.

## Login to gdcsrv2
ssh -Y studentXX@gdcsrv2.ethz.ch
# Y Enables trusted X11 forwarding.
# X Enables X11 forwarding.

## Prepare Directories
mkdir ${HOME}/QC
wd="${HOME}/QC"     # My working directory
cd ${wd}

## Get the zip file 
curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/NGS_Data_Examples.zip
## Have a look at the file before unzipping it
zipinfo sequence_data_examples.zip

Table: What you should see plus some information for the fastq datasets:

# File Description
01 example1_fq.gz Fastq sample file
02 example2_A.fq.gz RNA-Seq (HiSeq4000)
03 example2_B.fq.gz DNA-Seq (HiSeq4000)
04 example2_C.fq.gz ddRAD data (HiSeq2500)
05 example2_D.fq.gz Amplicon Sequencing (MiSeq)
06 example2_E.fq.gz Amplicon Sequencing (HiSeq2000)
07 example2_F.fq.gz DNA-Seq data (HiSeq2000)
08 example2_G.fq.gz DNA-Seq data (?)
09 example2_H.fq.gz Metagenomics (Hiseq2500)
## Make sure the download worked
md5sum sequence_data_examples.zip 
# You should see: 5e0bb4eaaff470adb951a33446064a4b  sequence_data_examples.zip

## Unzip the file
unzip sequence_data_examples.zip

FastQ Sequence Format

Let have a look at the example 1 file. We can use the command 'less' on the gzipped file but cat, tail, or head does not work directly. We can use zcat instead.

less example1_fq.gz # get back with [q]
zcat example1_fq.gz | head -n 4
Lines per Record (Read):
Line 1 Header  : @M01072:197:000000000-G181F:1:1101:13528:2035 1:N:0:GAGTGG
Line 2 Sequence: CATACTTGGTTTTCAGACATGGAGTCTAATTCAGATTGCATGGCTTCATGCCATTGCT...
Line 3 Reserved: +
Line 4 Quality : AAAAADFBBBAFGGGFFFFGBFB4FBGBFGHFB5FGH435D522AFA5F5533DD55D...
## Quality encoding
man ascii
# Example - Char: A -> Dec: 65 -> Quality: 65-33 = 32 -> p = 10^(-32/10) = 0.000631 -> 99.94%

Questions

  1. How long are the sequences (reads)?
  2. Are all reads from the same run?
  3. How many reads do we have?
Solutions

## (1) You can count or use infoseq:
gunzip example1.fastq.gz
infoseq -only -length example1.fastq | sort -u

## (2) The header will tell us
zgrep "^@" example1.fastq.gz
# There are two different instrument IDs: @M01072 / @M01035
zgrep "^@" example1.fastq.gz | cut -d ":" -f 1 | sort -u
# There are three run ids: 197, 177, and 201 
zgrep "^@" example1.fastq.gz | cut -d ":" -f 1,2,3 | sort -u

## (3) Carful with the grep
zgrep "^@M0" example1.fastq.gz | wc -l 
zcat "^@" example1.fastq.gz | wc -l # divided by 4

Quality Control

We are using FastQC to inspect the quality of the fastq data.

## Have look at the option first
fastqc -h

## Have look at the option first
fastqc -v

## Start FastQC
#  fastqc
## It works only if a X11 display is installed and 
## Y or X option was used during ssh login.

## Run FastQC
fastqc example2_*.fq.gz

Have a look at the html output file. Either download the files (Filezilla or Cyberduck) and open it with a local browser. If you have X11 installed and you used option -Y you can use the following command.

firefox ./example2_A_fastqc.html

Additional information

NGS Movies

Fastq Sequence Format

Data Submission


v191017 Jean-Claude Walser