Skip to content

Sequence Quality Control

Learning Objectives

Main
◇ Understand the fastq file format.
◇ Be able to use bash commands to explore fastq files.
Minor
◇ Understand the connection between bascalling accuracy and quality score.
◇ Understand the context-sensitive output of fastQC.

⬇︎ Sequence Quality Control
⬇︎ Run Quality Control


Base-Calling

Sequencing platforms generate raw signals as they read nucleotide sequences. Base calling algorithms analyse the detected signals to determine the sequence of bases (adenine, thymine, cytosine and guanine) in the original DNA (or RNA) sample. Illumina uses fluorescently labelled nucleotides and optical detection. Base calling involves interpreting the intensity and wavelength of the fluorescence signals. Pacific Biosciences (PacBio) uses real-time detection of fluorescently labelled nucleotides. Base calling involves analysing the fluorescence signals emitted during nucleotide incorporation. Oxford Nanopore measures changes in ionic current as nucleotides pass through a nanopore. Base calling translates these changes in current into nucleotide sequences. The translation of the raw signal into nucleotide bases is error prone and the accuracy depends on the algorithm used. The reliability of the base assignment is expressed as a probability and is encoded to reduce size. The sequence and probability are written together in a text-based file format called fastq.

FASTQ File Format

The FASTQ file format is a fixed data format that contains both the nucleotide sequence (line 2) and the corresponding quality scores (line 4).

1: @m54073_190302_105548/4325680/ccs.........................⇽ sequence header
2: TGGATCACTTGTGCAAGCATCACATCGTAGAGCTTGTCCTTGGTCTGCGA........⇽ sequence 
3: +.........................................................⇽ extra line for notes
4: p~~L~jubw~n~V~2~u~[sygSN~g~K~k~~~o~u}q}|is~~~~~~~k........⇽ quality encoding

Quality Score

Quality scores based on Phred are widely used to represent probabilities and confidence levels in the assignment of a base by basecalling software. Quality scores typically range from 0 to 40, although this range can vary. A higher score indicates a higher probability that a particular base call is correct. The Phred quality score (Q) is logarithmically related to the error probability (E), where

Q = {-10*\log_{10}E} \Rightarrow E = {10^{-{Q\over10}}} \Rightarrow P = 1-E.

Quality Encoding

The Quality scores (0-40) are encoded with single ASCII character (!-I).

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI       ➞  Encoding Character
 |        |         |         |         | 
 34       43        53        63        73      ➞  ASCII Code
 ⇣        ⇣         ⇣         ⇣         ⇣ 
 1        10        20        30        40      ➞  Q-Score (ASCII Code - 33)
 ⇣        ⇣         ⇣         ⇣         ⇣
 21%      90%       99%       99.9%     99.99%  ➞  Base call accuracy 

File Size

FASTQ sequence files can be big. It might contain millions of entries and can be several gigabytes in size. Best strategy is to work with compressed files.

FASTQ Case study

Examining the raw read sequencing data is a critical first step in any sequencing project. It ensures that the data is uncorrupted, of high quality and free from technical artefacts, which is essential for obtaining reliable and accurate results in downstream analyses. By addressing potential issues early, you can optimise your analysis pipeline and make informed decisions, ultimately leading to more robust and credible scientific conclusions.

For this example we will limit ourselves to one sample. It was sequenced using Illumina (paird-end) and the data consists of two archived fastq sequence read files.

Go to the working directory and make sure the zipped fastq file is there.

## Working Directory (in case you already have it)
cd ${HOME}/GDA/qc
ls -lh C-1_R[12].fastq.gz

If the files and/or folder are missing, create the folder and/or download the files.

# Go home
cd ${HOME}
# Create working directory
mkdir qc; cd qc
# Download sequence files
curl -O "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/C-1_R1.fastq.gz" 
curl -O "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/C-1_R2.fastq.gz" 

Here is some additional information about the files you have downloaded.

# MD5 hash for the two sequence files:
# 4a2e8876742302fad5bd24cba76c3cc6  C-1_R1.fastq.gz
# e0ce4ae266e8b2cedbc9580a3025712a  C-1_R2.fastq.gz 

Command Not Found

The following exercises will only work if all the necessary applications (e.g. seqtk) are installed. If you do not have the additional applications installed and do not want to install them, you can run the labs on the GDC VM server (gdc-vserver.ethz.ch).

Verification

First of all, we need to make sure that the download has worked. Using md5sum on a file you have downloaded from the internet serves several important purposes related to data integrity and security.

## Linux
md5sum C-1_R1.fastq.gz
md5sum C-1_R2.fastq.gz
## Mac
#md5 C-1_R1.fastq.gz
#md5 C-1_R2.fastq.gz
Compare the calculated MD5 hash with the provided MD5 hash. If they match, the file is probably intact and authentic. If they do not match, the file may be corrupt or tampered with.

Corruption & Integrity

When you download a file from the internet, especially large files or software packages, there's a risk that the file will be corrupted during the download process. Running md5sum on the downloaded file and comparing it with the MD5 hash provided will help to ensure that the file has not been corrupted. In addition, by checking the MD5 hash of the downloaded file against the one provided, you can confirm that the file is indeed the original, unmodified version released by the distributor. This helps prevent tampering and ensures that the file has not been altered or infected with malware.

Exploration

The further steps do not follow a prescribed procedure and are freely selectable and expandable. The important thing is to take a closer look at the files or a selection of them.

We start by looking at the first and last four lines. This may help to verify that the sequence records are complete.

## Test Completeness
zcat C-1_R1.fastq.gz | head -n 4 # first 4 entry
zcat C-1_R1.fastq.gz | tail -n 4 # last 4 entry
zcat C-1_R2.fastq.gz | head -n 4 # first 4 entry
zcat C-1_R2.fastq.gz | tail -n 4 # last 4 entry
# Note: In case zcat does not work, try ...
# zcat < C-1_R1.fastq.gz | head -n 4
# zcat < C-1_R1.fastq.gz | tail -n 4
# or
# gzcat C-1_R2.fastq.gz | head -n 4 # first 4 entry
# gzcat C-1_R2.fastq.gz | tail -n 4 # last 4 entry
Here is a reminder of what you are looking at and what it means.

Here is a reminder of what you are looking at and what it means.

  ⎧the unique instrument name
  |       ⎧the run id
  |       |  ⎧the flowcell id
  |       |  |               ⎧flowcell lane
  |       |  |               |
1 @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
2 GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGTGC...
3 +
4 >>1>>11>11>>1EC?E?CFBFAGFC0GB/CG1EACFE/BFE///AEG1DF122A...
  | |          |
  | |          └─ C ➞ E ➞ 36 Phred Quality Score (Q) ➞ 99.975 Base call accuracy (P)
  | └─ G ➞ 1 ➞ 16 Phred Quality Score (Q) ➞ 97.488 Base call accuracy (P)
  └─ G ➞ > ➞ 29 Phred Quality Score (Q) ➞ 99.874 Base call accuracy (P)
A few character encoding examples: ◦ Pos. #1 | Nuc. G | Character Encoding [>]
Q = ascii -s ">" | awk '{print $2-33}' = 29
E = $10^{-2.9}$ =  0.001258925 ➝ 0.126%
P = $1-( 10^{-2.9})$ = 0.998741075 ➝ 99.874%
◦ Pos. #3 | Nuc. G | Character Encoding [1]
Q = ascii -s 1 | awk '{print $2-33}' = 16
E = $10^{-1.6}$ = 0.025118864 ➝ 2.511%
P = $1-( 10^{-1.6} *100)$ = 0.974881136 ➝ 97.488%
◦ Pos. #14 | Nuc. C | Character Encoding [E]
Q = ascii -s E | awk '{print $2-33}' = 36
E = $10^{-3.6}$ = 0.000251189 ➝ 0.025%
P = $100-( 10^{-3.6} *100)$ = 0.999748811 ➝ 99.975%

Results:

  • The reads are 251nt in length

  • Based on the header information ...

R1 head: @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
R2 head: @M01072:41:000000000-A942B:1:1101:11853:2457 2:N:0:1

R1 tail: @M01072:41:000000000-A942B:1:2114:18740:28782 1:N:0:1
R2 tail: @M01072:41:000000000-A942B:1:2114:18740:28782 2:N:0:1

... we conclude

(a) the reads are read-pairs. (b) start and end with the same records. © appear to be from the same instrument, same run and same flow cell.

Read Count

Next, we want to count the sequences/reads in the fastq files. It is also important to check that we have the same number of forward (R1) and reverse (R2) reads.

Similar to the fasta file example, we could count the number of headers using grep with option c. But, instead of counting lines with > we could look for lines with @.

We use zcat (or gzcat) to read the compressed file (fastq.gz), because I don't want to uncompress it. There is also a zgrep for compressed files.

## Count lines with @
zcat C-1_R1.fastq.gz | grep "@" -c
# N = 203,083
## Alternatives
zcat < C-1_R1.fastq.gz | grep "@" -c
zgrep -c "@" C-1_R1.fastq.gz

The file C-1_R1.fastq.gz has 203,083 lines which contain a @. Does this number correspond to the number of sequence headers and can we infer the number of reads from it?

Do you see a problem with this solution?


  ## -----------------------------------------------------------
  ## Test - N(R1) == N(R2)
  ## We have paired-end reads (R1 and R2) and we would expect
  ## to have equal number or reads in both files.
  ## -----------------------------------------------------------

  zgrep -c "@" C-1_R1.fastq.gz # N = 203,083
  zgrep -c "@" C-1_R2.fastq.gz # N = 239,369

  zcat C-1_R1.fastq.gz | grep "@" -c # N = 203,083
  zcat C-1_R2.fastq.gz | grep "@" -c # N = 239,369
  
⇒ The count is different for R1 and R2. Something is wrong!
  ## -----------------------------------------------------------
  ## ASCII Table
  ## The ASCII character @ is also used to encode phred score 31
  ## and it is to be feared that we include quality encoding
  ## lines when counting. 
  ## -----------------------------------------------------------

  ## R code - ASCII-Code Phred-Score Encoding Table
  coderange  <- c(33:74)
  asciitable <- data.frame(coderange, coderange-33, row.names = rawToChar(as.raw(coderange), multiple = TRUE))
  colnames(asciitable) <- c("ASCII","PhredScore")
  asciitable
  
## Examples of records with @ in the quality encoding
  zcat < C-1_R1.fastq.gz | grep "?@" --color -B 3
  

Can you find a better (correct) solution to count the number of records in an fastq file?



  ## -----------------------------------------------------------
  ## Solution A - not really a solution :-(
  ## Idea: Fastq headers start with a @.
  ##       We only count lines that start with a @.
  ## -----------------------------------------------------------

  zgrep -c "^@" C-1_R1.fastq.gz # 145,297
  zgrep -c "^@" C-1_R2.fastq.gz # 146,160

  zcat C-1_R1.fastq.gz | grep "^@" -c  # 145,297
  zcat C-1_R2.fastq.gz | grep "^@" -c  # 146,160

  # ⇒  Not much better!

  ## -----------------------------------------------------------
  ## Solution B - not universal
  ## Idea: We extend the search query
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | head -n 1 # find unique instrument name
  # @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1

  ## Solution B1
  zcat C-1_R1.fastq.gz | grep "@M01072" -c # N = 144,738
  zcat C-1_R2.fastq.gz | grep "@M01072" -c # N = 144,738

  ## Solution B2
  zgrep -c "@M01072" C-1_R1.fastq.gz       # N = 144,738
  zgrep -c "@M01072" C-1_R2.fastq.gz       # N = 144,738

  # ⇒  This seems to work. Counts for both files are identical.
  # ☛  This is not a universal solution. 

  ## -----------------------------------------------------------
  ## Solution C
  ## Idea: Fastq files have 4 lines per record. 
  ##       The number of line divided by 4 = number of records
  ## -----------------------------------------------------------

  ## Solution C1
  zcat C-1_R1.fastq.gz | wc -l       # 578,952 / 4 = 144,738

  ## A few Alternatives:
  # zcat < C-1_R1.fastq.gz | wc -l   # for zsh
  # zgrep -Ec "$" C-1_R1.fastq.gz    # count lines
  # gzip -cd C-1_R1.fastq.gz | wc -l # redirect gzip stdout

  # ⇒  This seems to work too. Counts are identical to solution B.
  # ☛  This is a universal solution but we have to some manual work.

  ## Solution C2
  echo "$(zcat C-1_R1.fastq.gz | wc -l) / 4" | bc    # N = 144,738

  # echo "$(gzcat C-1_R1.fastq.gz | wc -l) / 4" | bc        # zsh alternative 
  # bc <<< "$(wc -l < <(gzcat C-1_R1.fastq.gz)) / 4"        # another one
  # gzcat C-1_R1.fastq.gz | wc -l | awk '{print int($1/4)}' # one more

  # ⇒  This seems to work too. Counts are identical to solution B.
  # ⇒  This is a universal solution.
  # ☛  It is not a fast way to count sequences.

  ## -----------------------------------------------------------
  ## Solution D
  ## Idea: Count lines starting (^) and ending ($) with + sign. 
  ## -----------------------------------------------------------

  zgrep -c "^+$" C-1_R1.fastq.gz # N = 144,738

  # ⇒  This seems to work too and it is in agreement with B and C.
  # ☛  Careful: It only works is the separator line was not altered.

  

We have found possible solutions to count the number of reads of a fastq file but are we sure it really works? In other words, do we really have 144'738 reads in this file?

Dummy Data

If you are not 100% sure about your command line, use a well-defined dummy dataset (e.g. extract a smaller subsample). You know what to expect, so you can test your code. You can also (double) check your results by using alternative commands/solutions. This is also a good way to broaden your command repertoire.

## Subset-Data (first 100 records only)
zcat C-1_R1.fastq.gz | head -n 400 | grep "@" -c          # N = 142
zcat C-1_R1.fastq.gz | head -n 400 | grep "^@" -c         # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | grep "@M01072" -c    # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | grep -c "^+$"        # N = 100

Extract Records > Subset

A good subset is often helpful to increase processing speed when exploring your data. For example, you would like to estimate possible PhiX (an internal Illumina standard) contamination in your dataset (Mukherjee et al. 2015). Screening a subset would give you an approximate estimate, good enough to decide if a PhiX filtering step is needed.

First we would like to extract a single read.

# Get first read
zcat C-1_R1.fastq.gz | head -n 4 > First_R1.fq

# Get Last read
zcat C-1_R1.fastq.gz | tail -n 4 > Last_R1.fq

# Get a particular read
zgrep -A 3 "@M01072:41:000000000-A942B:1:1101:8091:3906" C-1_R1.fastq.gz > Exract_R1.fq
# > -A: hit and 3 lines after

# Get a random read
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 1  # e.g. @M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1
# sort: -R for random sorting
zgrep -A 3 "@M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1" C-1_R1.fastq.gz > Random_R1.fq

# Get a random subset of reads
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 10 > random10.list
zgrep -A 3 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.fq

However, this may not be a very fast solution. A possible faster alternative is to use seqtk to extract a random subset. Note that this will work on the GDC server, but will not work on your machine if you do not have seqtk installed.

## Extract 10 random sequences
seqtk sample C-1_R1.fastq.gz 10 > C-1_R1_subset_10.fq
seqtk sample C-1_R2.fastq.gz 10 > C-1_R2_subset_10.fq

Compare the two subsets and pay attention to the read pairing. Can you improve the subsetting and extract the same reads from the forward (R1) and the reverse (R2) sequence file?

Can you find a way to extract a random subset of read pairs?

We use the seed option (-s[[digit]]) to keep read-pairs.


  ## Extract 10 random sequences with seed
  seqtk sample -s123 C-1_R1.fastq.gz 10 > C-1_R1_subset_10_sync.fq
  seqtk sample -s123 C-1_R2.fastq.gz 10 > C-1_R2_subset_10_sync.fq
  

  ## Compare subsets
  grep "@" C-1_R1_subset_10_sync.fq | sort > R1_header.txt
  grep "@" C-1_R2_subset_10_sync.fq | sort > R2_header.txt
  diff R1_header.txt R2_header.txt
  

Find a way to extract 10 random FAST[A] sequences from C-1_R1.fastq.gz.


  ## -----------------------------------------------------------
  ## Solution A
  ## We only need the first two lines from a fastq record and
  ## we have to replace @ with >.
  ## -----------------------------------------------------------

  ## Get first two lines only (header and sequence)
  zgrep -A 1 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.tmp
  ## Replace @ with >
  tr "@" ">" < Random_Subset_N10_R1.tmp > Random_Subset_N10_R1.fa
  ## Double-check
  grep ">" -c Random_Subset_N10_R1.fa

  ## -----------------------------------------------------------
  ## Solution B
  ## seqtk can also convert fq to fa
  ## -----------------------------------------------------------

  seqtk seq -a C-1_R1_subset_10.fq > C-1_R1_subset_10.fa
  grep ">" -c C-1_R1_subset_10.fa
  

Playtime - A waste of time or worth your while?

It may seem like a waste of time to explore your data before you perform quality control (QC). I disagree. QC is an important step before filtering or processing data, but it has its limits. The better you understand your data from the start, the more control you have over it. For example, you have uneven sequencing depth and some samples have few reads or failed. This will help you interpret the QC report. Another example is that you notice an uneven number of reads in the R1 and R2 files. This would indicate that your download was incomplete. You discover a high level of PhiX in your subset. You are running a filter before QC. There are many good reasons to play with your data before jumping headlong into analysis.

Quality Control

For qualitative assessment of raw data we use FastQC. It is easy to use and produces a graphically attractive report. However, care should be taken when interpreting the results. FastQC is a contextual analysis and is designed for genomic data. When assessing quality you need to consider the type of data (e.g. RNA-Seq, Amp-Seq, RAD-Seq) and remember that a red light is not necessarily bad and a green light does not mean all is well.

## Working Directory
if [ ! -d "${HOME}/MDA/qc" ]; then
  mkdir -p "${HOME}/MDA/qc"
fi
cd "${HOME}/MDA/qc"

## ---------------
## QC with FastQC
## ---------------

## Help
fastqc -help

## Version
fastqc -version

## Run FastQC on R1 and R2
fastqc -t 1 --kmers 8 C-1_R?.fastq.gz
# t number of threads
What are k-mers?

FastQC produces two output files for each sequence file. There is an interactive HTML document, and a zipped file with various outputs.

C-1_R1_fastqc.html
C-1_R1_fastqc.zip
C-1_R2_fastqc.html
C-1_R2_fastqc.zip

Use your (S)FTP client (e.g. Cyberduck) to download the HTML files and your web browser to open the report. An alternative would be to download using a terminal and the scp command. You will need to open a new (local) terminal window for this. For more information, see the "File Exchange" chapter on the Biocomputing page.

scp guest??@gdc-vserver.ethz.ch:/home/guest??/qc/C-1_R*_fastqc.html ${HOME}/qc
scp guest??@gdc-vserver.ethz.ch:/home/guest??/qc/C-1_R*_fastqc.zip ${HOME}/qc

We can also peek inside the zipped folder and open the text-based reports. 

## Explore the zipped Results
less C-1_R1_fastqc.zip
unzip -p C-1_R1_fastqc.zip C-1_R1_fastqc/summary.txt | less

FastQC has a range of interesting options. On the one hand, you can load zipped Fastq files, but other data formats can also be used (e.g. BAM, fast5). You can also search for costum adapters or target sequences (e.g. contaminants).

QC Facts & Considerations

There are a few things you might consider bevore running a quality control.

➼ Subset - Does it make sense to run QC on the entire dataset? Would a subset also work?
➼ Data Structure - QC is based on data structure analysis. You have to know the structure of your data.
➼ Adaptors - Make sure you are looking for the right adaptor sequences.
➼ Quality Drop - The quality decreases with read length.
➼ R1 vs. R2 - Average quality is higher for the first (forward) read (R1) compare to the second (reverse) read (R2).

RTFM (Read The Fine Manual)

Like all applications, FastQC has limitations and settings that the user needs to be aware of. For this reason I think it is better to know a few applications well than many superficially. Reading the manual is a good starting point to understand an application better. I agree with you. Some manuals (but not all!) are terrible to read. Sentences like "use the settings button to change the settings" are not very helpful, but unfortunately common. But at least give each manual a chance, because the author(s) have spent some time on it. The FastQC manual (link below) is an example of a helpful manual. You will learn that some modules (e.g. Duplicate Sequences) use "shortcuts" to increase the speed of analysis. It is important that you know and understand these limitations.

FastQC with R

Windows and FastQC in R

If you use a Linux emulator FastQC for R does not work. Rqc would be a possible alternative (see below). If you have Ubuntu installed you have to specify the path to the installation.

## Install / Load fastqcr
#install.packages("fastqcr")
library("fastqcr")

## FastQC
fastqc(fq.dir = "${HOME}/qc/", qc.dir = "${HOME}/qc/"))

## FastQC for Windows with Ubuntu 
# fastqcr::fastqc("seq.fq.gz", fastqc.path = "wsl.exe /path_to_wsl_fastqc_install/fastqc")

## QC-Plots
qc.C01.R1 <- qc_read(file = "~/qc/C-1_R1_fastqc.zip", modules = "all", verbose = TRUE)

qc_plot(qc.C01.R1, "Per sequence GC content")
qc_plot(qc.C01.R1, "Per base sequence quality")
qc_plot(qc.C01.R1, "Per sequence quality scores")
qc_plot(qc.C01.R1, "Per base sequence content")
qc_plot(qc.C01.R1, "Sequence duplication levels")

## Aggregate FastQC Reports for Multiple Samples
qc <- qc_aggregate(qc.dir = "~/qc/")

## Summary Report
library(dplyr)
qc %>%
  select(sample, module, status) %>%    
  filter(status %in% c("WARN", "FAIL")) %>%
  arrange(sample)

## Generate Multi QC Report
qc_report(qc.path = "~/qc/", result.file = "/Users/jwalser/qc",
          experiment = "GDA - Test")

More R Packages for QC

Rqc is an optimized tool designed for quality control and assessment of high-throughput sequencing data. It performs parallel processing of entire files and produces a report which contains a set of high-resolution graphics that can be used for quality assessment.

library(Rqc)
# Depends on: BiocParallel, ShortRead, ggplot2
library(ggplot2)
library(BiocParallel)
library(ShortRead)

Rqc::rqc(path = "/path/to/workfolder", pattern = ".fastq.gz")

# Basic Usage:
# rqc(path = folder, pattern = ".fastq.gz")
# folder:  folder with squences
# pattern: regular expression that identifies all files of interest

QC Summary

With MultiQC it is possible to combine individual FastQC reports into one interactive report in HTML format.

QC Screening

FastQ Screen allows you to compare fastq sequences to a number of sequence databases to better understand library composition.

Additional information

Literature