NGS
FastQ Exercises¶
Download the some example data for the quality control exercise.
## Login to gdcsrv2 ssh -Y studentXX@gdcsrv2.ethz.ch # Y Enables trusted X11 forwarding. # X Enables X11 forwarding. ## Prepare Directories mkdir ${HOME}/QC wd="${HOME}/QC" # My working directory cd ${wd} ## Get the zip file curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/NGS_Data_Examples.zip ## Have a look at the file before unzipping it zipinfo sequence_data_examples.zip
Table: What you should see plus some information for the fastq datasets:
# | File | Description |
---|---|---|
01 | example1_fq.gz | Fastq sample file |
02 | example2_A.fq.gz | RNA-Seq (HiSeq4000) |
03 | example2_B.fq.gz | DNA-Seq (HiSeq4000) |
04 | example2_C.fq.gz | ddRAD data (HiSeq2500) |
05 | example2_D.fq.gz | Amplicon Sequencing (MiSeq) |
06 | example2_E.fq.gz | Amplicon Sequencing (HiSeq2000) |
07 | example2_F.fq.gz | DNA-Seq data (HiSeq2000) |
08 | example2_G.fq.gz | DNA-Seq data (?) |
09 | example2_H.fq.gz | Metagenomics (Hiseq2500) |
## Make sure the download worked md5sum sequence_data_examples.zip # You should see: 5e0bb4eaaff470adb951a33446064a4b sequence_data_examples.zip ## Unzip the file unzip sequence_data_examples.zip
FastQ Sequence Format¶
Let have a look at the example 1 file. We can use the command 'less' on the gzipped file but cat
, tail
, or head
does not work directly. We can use zcat
instead.
less example1_fq.gz # get back with [q] zcat example1_fq.gz | head -n 4
Lines per Record (Read): Line 1 Header : @M01072:197:000000000-G181F:1:1101:13528:2035 1:N:0:GAGTGG Line 2 Sequence: CATACTTGGTTTTCAGACATGGAGTCTAATTCAGATTGCATGGCTTCATGCCATTGCT... Line 3 Reserved: + Line 4 Quality : AAAAADFBBBAFGGGFFFFGBFB4FBGBFGHFB5FGH435D522AFA5F5533DD55D...
## Quality encoding man ascii # Example - Char: A -> Dec: 65 -> Quality: 65-33 = 32 -> p = 10^(-32/10) = 0.000631 -> 99.94%
Questions¶
- How long are the sequences (reads)?
- Are all reads from the same run?
- How many reads do we have?
Solutions
## (1) You can count or use infoseq: gunzip example1.fastq.gz infoseq -only -length example1.fastq | sort -u ## (2) The header will tell us zgrep "^@" example1.fastq.gz # There are two different instrument IDs: @M01072 / @M01035 zgrep "^@" example1.fastq.gz | cut -d ":" -f 1 | sort -u # There are three run ids: 197, 177, and 201 zgrep "^@" example1.fastq.gz | cut -d ":" -f 1,2,3 | sort -u ## (3) Carful with the grep zgrep "^@M0" example1.fastq.gz | wc -l zcat "^@" example1.fastq.gz | wc -l # divided by 4
Quality Control¶
We are using FastQC to inspect the quality of the fastq data.
## Have look at the option first fastqc -h ## Have look at the option first fastqc -v ## Start FastQC # fastqc ## It works only if a X11 display is installed and ## Y or X option was used during ssh login. ## Run FastQC fastqc example2_*.fq.gz
Have a look at the html
output file. Either download the files (Filezilla or Cyberduck) and open it with a local browser. If you have X11 installed and you used option -Y
you can use the following command.
firefox ./example2_A_fastqc.html
Additional information¶
NGS Movies¶
- Illumina Cluster Sequencing
- PacBio Single-Molecule Sequencing
- Oxford Nanopore Sequencing
- BioNano Genomics
Fastq Sequence Format¶
Data Submission¶
v191017 Jean-Claude Walser