Next Generation Sequencing - Quality Control

Electropherogram

Lecture Notes

⬇︎ NGS QC


FASTQ Data Files

The FASTQ file is a text-based format that contains both the nucleotide sequence (line 2) and the corresponding quality scores (line 4). The Quality scores (0-40) are encoded with single ASCII character (!-I).

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI       ➞  Encoding Character
 |        |         |         |         | 
 34       43        53        63        73      ➞  ASCII Code
 ⇣        ⇣         ⇣         ⇣         ⇣ 
 1        10        20        30        40      ➞  Q-Score (ASCII Code - 33)
 ⇣        ⇣         ⇣         ⇣         ⇣
 21%      90%       99%       99.9%     99.99%  ➞  Base call accuracy 

FASTQ sequence files can be big. It might contain millions of entries and can be several gigabytes in size. Best strategy is to work with compressed files.

## Working Directory
cd ${HOME}/MDA20/qc
ls -lh C-1_R[12].fastq.gz

# If the files are missing download them again.
# wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/C-1_R1.fastq.gz" 
# wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/C-1_R2.fastq.gz" 

Basics Steps

Before we analyse the data, we should first explore the read files. A better understanding of the data will help us to optimize data preparation and it will provide more confidence.

Completeness
## Test Completeness
zcat C-1_R[12].fastq.gz | head -n 4 # first 4 entry
zcat C-1_R[12].fastq.gz | tail -n 4 # last 4 entry
Encoding Example:
1 @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
2 GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGTGC...
3 +
4 >>1>>11>11>>1EC?E?CFBFAGFC0GB/CG1EACFE/BFE///AEG1DF122A...
  | |          |
  | |          └─ C ➞ E ➞ 36 Phred Quality Score (Q) ➞ 99.975 Base call accuracy (P)
  | └─ G ➞ 1 ➞ 16 Phred Quality Score (Q) ➞ 97.488 Base call accuracy (P)
  └─ G ➞ > ➞ 29 Phred Quality Score (Q) ➞ 99.874 Base call accuracy (P)

➜ Pos. #1 | Nuc. G | Character Encoding [>]
Q = ascii -s ">" | awk '{print $2-33}' = 29
P = 100-( 10^{-2.9} *100) = 99.874

➜ Pos. #3 | Nuc. G | Character Encoding [1]
Q = ascii -s 1 | awk '{print $2-33}' = 16
P = 100-( 10^{-1.6} *100) = 97.488

➜ Pos. #14 | Nuc. C | Character Encoding [E]
Q = ascii -s E | awk '{print $2-33}' = 36
P = 100-( 10^{-3.6} *100) = 99.975

Read Count

Next, we would like to count the sequences/reads in the fastq files. Similar to the fasta file example, we could count the number of headers using grep with option c. But, instead of counting lines with > we could look for lines with @.

zcat C-1_R1.fastq.gz | grep "@" -c
# N = 203,083

⇒ There are 203,083 reads in the file.

Do you see a problem with this solution?


  ## -----------------------------------------------------------
  ## Test - N(R1) == N(R2)
  ## We have paired-end reads (R1 and R2) and we would expect
  ## to have equal number or reads in both files.
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | grep "@" -c
  # N = 203,083
  zcat C-1_R2.fastq.gz | grep "@" -c
  # N = 239,369
  
⇒ The count is different for R1 and R2. Something is wrong!
  ## -----------------------------------------------------------
  ## ASCII Table
  ## The ascii character @ is also used to encode phred score 31
  ## and it is possible to count quality encoding lines too.
  ## -----------------------------------------------------------

  ## R code - ASCII-Code Phred-Score Encoding Table
  coderange <- c(33:74)
  asciitable <- data.frame(coderange, coderange-33, row.names = rawToChar(as.raw(coderange), multiple = TRUE))
  colnames(asciitable) <- c("ASCII","PhredScore")
  asciitable
  

Can you find a better (correct) solution to count the number of records in an fastq file?



  ## -----------------------------------------------------------
  ## Solution A - not really a solution :-(
  ## Idea: Fastq headers start with a @.
  ##       We only count lines that start with a @.
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | grep "^@" -c  # 145,297
  zcat C-1_R2.fastq.gz | grep "^@" -c  # 146,160
  # ⇒  Not much better!

  ## -----------------------------------------------------------
  ## Solution B - not universal
  ## Idea: We extend the search query
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | head -n 1 # find unique instrument name
  # @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1

  ## Solution B1
  zcat C-1_R1.fastq.gz | grep "@M01072" -c # N = 144,738
  zcat C-1_R2.fastq.gz | grep "@M01072" -c # N = 144,738

  ## Solution B2
  zgrep -c "@M01072" C-1_R1.fastq.gz       # N = 144,738

  # ⇒  This seems to work. Counts for both files are identical.
  # ☛  This is not a universal solution. 

  ## -----------------------------------------------------------
  ## Solution C
  ## Idea: Fastq files have 4 lines per record. 
  ##       The number of line divided by 4 = number of records
  ## -----------------------------------------------------------

  ## Solution C1
  zcat C-1_R1.fastq.gz | wc -l     # 578,952 / 4 = 144,738
  zgrep -Ec "$" C-1_R1.fastq.gz    # 578,952 / 4 = 144,738
  gzip -cd C-1_R1.fastq.gz | wc -l # 578,952 / 4 = 144,738

  # ⇒  This seems to work too. Counts are identical to solution B.
  # ☛  This is a universal solution but we have to some manual work.

  ## Solution C2
  zcat C-1_R1.fastq.gz | echo $((`wc -l`/4)) # N = 144,738

  # ⇒  This seems to work too. Counts are identical to solution B.
  # ☛  This is a universal solution.

  ## -----------------------------------------------------------
  ## Solution D
  ## Idea: Count lines starting (^) and ending ($) with + sign. 
  ## -----------------------------------------------------------

  zgrep -c "^+$" C-1_R1.fastq.gz # N = 144,738

  # ⇒  This seems to work too and it is in agreement with B and C.
  # ☛  Careful: It only works is the separator line was not altered.

  

We have found possible solutions to count the number of reads of a fastq file but are we sure it really works? In other words, do we really have 144,738 reads in this file?

## Subset-Data
zcat C-1_R1.fastq.gz | head -n 400 | grep "@M01072" -c   # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | echo $((`wc -l`/4)) # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | grep -c "^+$"       # N = 100

Tip

If you are not 100-percent sure about your command line, use a simple dummy data set (e.g. extract a small subsample). You could also verify your results by using alternative commands/solutions. This is also a good way to expand your repertoire of commands.

Extract Records

A good subset is often helpful to increase processing speed while exploring your data. For example, you would like to estimate possible PhiX (an internal Illumina standard) contamination in your dataset (Mukherjee et al. 2015). Screening a subset would give you an approximate estimate good enough for you to decided if a PhiXfiltering step is needed.

First we would like to extract a single read.

# Get first read
zcat C-1_R1.fastq.gz | head -n 4 > First_R1.fq

# Get Last read
zcat C-1_R1.fastq.gz | tail -n 4 > Last_R1.fq

# Get a particular read
zgrep -A 3 "@M01072:41:000000000-A942B:1:1101:8091:3906" C-1_R1.fastq.gz > Exract_R1.fq
# > -A: hit and 3 lines after

# Get a random read
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 1  # e.g. @M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1
# sort: -R for random sorting
zgrep -A 3 "@M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1" C-1_R1.fastq.gz > Random_R1.fq

# Get a random subset of reads
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 10 > random10.list
zgrep -A 3 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.fq

This might not be a fast solution. We could use seqtk to extract a random subset faster. This works on the GDC server but might not work on your compuer if seqtk in not installed.

## Extract 10 random sequences
seqtk sample C-1_R1.fastq.gz 10 > C-1_R1_subset_10.fq
Find a way to extract 10 random FASTA seqeunces from C-1_R1.fastq.gz.


  ## -----------------------------------------------------------
  ## Solution A
  ## We only need the first two lines from a fastq record and
  ## we have to replace @ with >.
  ## -----------------------------------------------------------

  ## Get first two lines only (header and sequence)
  zgrep -A 1 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.tmp
  ## Replace @ with >
  tr "@" ">" < Random_Subset_N10_R1.tmp > Random_Subset_N10_R1.fa
  ## Double-check
  grep ">" -c Random_Subset_N10_R1.fa

  ## -----------------------------------------------------------
  ## Solution B
  ## seqtk can also convert fq to fa
  ## -----------------------------------------------------------

  seqtk seq -a C-1_R1_subset_10.fq > C-1_R1_subset_10.fa
  grep ">" -c C-1_R1_subset_10.fa
  

Playtime

It might seem a waste of time to explore your data before you run a quality control. The better you know your data the better you can understand the QC reportsand decide on possible filtering step. For example, if you have uneven sequening depth and some samples have only a few reads it will help with the QC report interpretation. Another example, you discover that you have a high level of PhiX among your reads you might filter the data prior to QC.

Quality Control

You can use FastQC for a convenient quality control check on the raw sequence data. Be careful though! It might be convenient but also misleading. FastQC is a context dependend analysis (genome sequencing data) and if you are not using an un-expected dataset the interpretion of the report needs to be adjusted.

## ---------------
## QC with FastQC
## ---------------

## Help
fastqc -help

## Version
fastqc -version

## Run FastQC on R1 and R2
fastqc -t 1 --kmers 8 C-1_R?.fastq.gz
# t number of threads

FastQC produces two output files. There is an interactive HTML document, and a zipped file with various outputs. Use your (S)FTP client (e.g. Cyberduck) to download the HTML files and your web browser to open the report. We can also peek inside the zipped folder and open the text-based reports. 

## Explore the zipped Results
unzip -p C-1_R1_fastqc.zip C-1_R1_fastqc/fastqc_data.txt | less
unzip -p C-1_R1_fastqc.zip C-1_R1_fastqc/summary.txt | less

FastQC is a widespread application to better understand your dataset. It is convenient but not perfect. The traffic lights (pass-green, warning-orange, fail-red) can be misleading depending on your dataset. You expect high levels of duplicates for an e.g.  RADseq or AmpSeq dataset.

QC Facts

➼ Subset - It does not make sense to run QC on the entire dataset. Use a subset instead.
➼ Data Structure - QC is based on data structure analysis. Unstructured is not always good.
➼ Adaptors - Make sure you are looking for the right adaptor sequences.
➼ Quality drops - The quality decreases with read length.
➼ R1 vs. R2 - Average quality is higher for the first (forward) read (R1) compare to the second (reverse) read (R2).

RTFM

It is better to know a few applications well than many superficially. Reading the manual is a good starting point to better understand an application. I agree with you. Some manuals (but not all!) are awful to read. Sentences like “use the setting button to change the settings” are not very helpful but unfortunately wide spread. But at least give each manual a chance because the author(s) did spend some time on it. The FastQC manual (like below) is an example of a good manual. You will learn that some modules (e.g. Duplicate Sequences) use "shortcuts" to increase analysis speed. It is important you know and understand this restrictions.

Additional information

Literature