Sequence Quality Control
Learning Objectives
Main
◇ Understand the fastq file format.
◇ Be able to use bash commands to explore fastq files.
Minor
◇ Understand the connection between bascalling accuracy and quality score.
◇ Understand the context-sensitive output of fastQC.
⬇︎ Sequence Quality Control
⬇︎ Run Quality Control
Base-Calling
Sequencing platforms generate raw signals as they read nucleotide sequences. Base calling algorithms analyse the detected signals to determine the sequence of bases (adenine, thymine, cytosine and guanine) in the original DNA (or RNA) sample. Illumina uses fluorescently labelled nucleotides and optical detection. Base calling involves interpreting the intensity and wavelength of the fluorescence signals. Pacific Biosciences (PacBio) uses real-time detection of fluorescently labelled nucleotides. Base calling involves analysing the fluorescence signals emitted during nucleotide incorporation. Oxford Nanopore measures changes in ionic current as nucleotides pass through a nanopore. Base calling translates these changes in current into nucleotide sequences. The translation of the raw signal into nucleotide bases is error prone and the accuracy depends on the algorithm used. The reliability of the base assignment is expressed as a probability and is encoded to reduce size. The sequence and probability are written together in a text-based file format called fastq.
FASTQ File Format
The FASTQ file format is a fixed data format that contains both the nucleotide sequence (line 2) and the corresponding quality scores (line 4).
1: @m54073_190302_105548/4325680/ccs.........................⇽ sequence header
2: TGGATCACTTGTGCAAGCATCACATCGTAGAGCTTGTCCTTGGTCTGCGA........⇽ sequence
3: +.........................................................⇽ extra line for notes
4: p~~L~jubw~n~V~2~u~[sygSN~g~K~k~~~o~u}q}|is~~~~~~~k........⇽ quality encoding
Quality Score
Quality scores based on Phred are widely used to represent probabilities and confidence levels in the assignment of a base by basecalling software. Quality scores typically range from 0 to 40, although this range can vary. A higher score indicates a higher probability that a particular base call is correct. The Phred quality score (Q) is logarithmically related to the error probability (E), where
Quality Encoding
The Quality scores (0-40) are encoded with single ASCII character (!-I).
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI ➞ Encoding Character
| | | | |
34 43 53 63 73 ➞ ASCII Code
⇣ ⇣ ⇣ ⇣ ⇣
1 10 20 30 40 ➞ Q-Score (ASCII Code - 33)
⇣ ⇣ ⇣ ⇣ ⇣
21% 90% 99% 99.9% 99.99% ➞ Base call accuracy
File Size
FASTQ sequence files can be big. It might contain millions of entries and can be several gigabytes in size. Best strategy is to work with compressed files.
FASTQ Case study
Examining the raw read sequencing data is a critical first step in any sequencing project. It ensures that the data is uncorrupted, of high quality and free from technical artefacts, which is essential for obtaining reliable and accurate results in downstream analyses. By addressing potential issues early, you can optimise your analysis pipeline and make informed decisions, ultimately leading to more robust and credible scientific conclusions.
For this example we will limit ourselves to one sample. It was sequenced using Illumina (paird-end) and the data consists of two archived fastq sequence read files.
Go to the working directory and make sure the zipped fastq file is there.
## Working Directory (in case you already have it)
cd ${HOME}/GDA/qc
ls -lh C-1_R[12].fastq.gz
If the files and/or folder are missing, create the folder and/or download the files.
# Go home
cd ${HOME}
# Create working directory
mkdir qc; cd qc
# Download sequence files
curl -O "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/C-1_R1.fastq.gz"
curl -O "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/C-1_R2.fastq.gz"
Here is some additional information about the files you have downloaded.
# MD5 hash for the two sequence files:
# 4a2e8876742302fad5bd24cba76c3cc6 C-1_R1.fastq.gz
# e0ce4ae266e8b2cedbc9580a3025712a C-1_R2.fastq.gz
Command Not Found
The following exercises will only work if all the necessary applications (e.g. seqtk) are installed. If you do not have the additional applications installed and do not want to install them, you can run the labs on the GDC VM server (gdc-vserver.ethz.ch).
Verification
First of all, we need to make sure that the download has worked. Using md5sum on a file you have downloaded from the internet serves several important purposes related to data integrity and security.
## Linux
md5sum C-1_R1.fastq.gz
md5sum C-1_R2.fastq.gz
## Mac
#md5 C-1_R1.fastq.gz
#md5 C-1_R2.fastq.gz
Corruption & Integrity
When you download a file from the internet, especially large files or software packages, there's a risk that the file will be corrupted during the download process. Running md5sum on the downloaded file and comparing it with the MD5 hash provided will help to ensure that the file has not been corrupted. In addition, by checking the MD5 hash of the downloaded file against the one provided, you can confirm that the file is indeed the original, unmodified version released by the distributor. This helps prevent tampering and ensures that the file has not been altered or infected with malware.
Exploration
The further steps do not follow a prescribed procedure and are freely selectable and expandable. The important thing is to take a closer look at the files or a selection of them.
We start by looking at the first and last four lines. This may help to verify that the sequence records are complete.
## Test Completeness
zcat C-1_R1.fastq.gz | head -n 4 # first 4 entry
zcat C-1_R1.fastq.gz | tail -n 4 # last 4 entry
zcat C-1_R2.fastq.gz | head -n 4 # first 4 entry
zcat C-1_R2.fastq.gz | tail -n 4 # last 4 entry
# Note: In case zcat does not work, try ...
# zcat < C-1_R1.fastq.gz | head -n 4
# zcat < C-1_R1.fastq.gz | tail -n 4
# or
# gzcat C-1_R2.fastq.gz | head -n 4 # first 4 entry
# gzcat C-1_R2.fastq.gz | tail -n 4 # last 4 entry
Here is a reminder of what you are looking at and what it means.
Here is a reminder of what you are looking at and what it means.
⎧the unique instrument name
| ⎧the run id
| | ⎧the flowcell id
| | | ⎧flowcell lane
| | | |
1 @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
2 GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGTGC...
3 +
4 >>1>>11>11>>1EC?E?CFBFAGFC0GB/CG1EACFE/BFE///AEG1DF122A...
| | |
| | └─ C ➞ E ➞ 36 Phred Quality Score (Q) ➞ 99.975 Base call accuracy (P)
| └─ G ➞ 1 ➞ 16 Phred Quality Score (Q) ➞ 97.488 Base call accuracy (P)
└─ G ➞ > ➞ 29 Phred Quality Score (Q) ➞ 99.874 Base call accuracy (P)
Q = ascii -s ">" | awk '{print $2-33}' = 29
E = $10^{-2.9}$ = 0.001258925 ➝ 0.126%
P = $1-( 10^{-2.9})$ = 0.998741075 ➝ 99.874%
Q = ascii -s 1 | awk '{print $2-33}' = 16
E = $10^{-1.6}$ = 0.025118864 ➝ 2.511%
P = $1-( 10^{-1.6} *100)$ = 0.974881136 ➝ 97.488%
Q = ascii -s E | awk '{print $2-33}' = 36
E = $10^{-3.6}$ = 0.000251189 ➝ 0.025%
P = $100-( 10^{-3.6} *100)$ = 0.999748811 ➝ 99.975%
Results:
-
The reads are
251nt
in length -
Based on the header information ...
R1 head: @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
R2 head: @M01072:41:000000000-A942B:1:1101:11853:2457 2:N:0:1
R1 tail: @M01072:41:000000000-A942B:1:2114:18740:28782 1:N:0:1
R2 tail: @M01072:41:000000000-A942B:1:2114:18740:28782 2:N:0:1
... we conclude
(a) the reads are read-pairs. (b) start and end with the same records. © appear to be from the same instrument, same run and same flow cell.
Read Count
Next, we want to count the sequences/reads in the fastq files. It is also important to check that we have the same number of forward (R1) and reverse (R2) reads.
Similar to the fasta file example, we could count the number of headers using grep
with option c
. But, instead of counting lines with >
we could look for lines with @
.
We use zcat
(or gzcat
) to read the compressed file (fastq.gz), because I don't want to uncompress it. There is also a zgrep
for compressed files.
## Count lines with @
zcat C-1_R1.fastq.gz | grep "@" -c
# N = 203,083
## Alternatives
zcat < C-1_R1.fastq.gz | grep "@" -c
zgrep -c "@" C-1_R1.fastq.gz
The file C-1_R1.fastq.gz
has 203,083 lines which contain a @
. Does this number correspond to the number of sequence headers and can we infer the number of reads from it?
Do you see a problem with this solution?
## -----------------------------------------------------------
## Test - N(R1) == N(R2)
## We have paired-end reads (R1 and R2) and we would expect
## to have equal number or reads in both files.
## -----------------------------------------------------------
zgrep -c "@" C-1_R1.fastq.gz # N = 203,083
zgrep -c "@" C-1_R2.fastq.gz # N = 239,369
zcat C-1_R1.fastq.gz | grep "@" -c # N = 203,083
zcat C-1_R2.fastq.gz | grep "@" -c # N = 239,369
⇒ The count is different for R1 and R2. Something is wrong!
## -----------------------------------------------------------
## ASCII Table
## The ASCII character @ is also used to encode phred score 31
## and it is to be feared that we include quality encoding
## lines when counting.
## -----------------------------------------------------------
## R code - ASCII-Code Phred-Score Encoding Table
coderange <- c(33:74)
asciitable <- data.frame(coderange, coderange-33, row.names = rawToChar(as.raw(coderange), multiple = TRUE))
colnames(asciitable) <- c("ASCII","PhredScore")
asciitable
## Examples of records with @
in the quality encoding
zcat < C-1_R1.fastq.gz | grep "?@" --color -B 3
Can you find a better (correct) solution to count the number of records in an fastq file?
## -----------------------------------------------------------
## Solution A - not really a solution :-(
## Idea: Fastq headers start with a @.
## We only count lines that start with a @.
## -----------------------------------------------------------
zgrep -c "^@" C-1_R1.fastq.gz # 145,297
zgrep -c "^@" C-1_R2.fastq.gz # 146,160
zcat C-1_R1.fastq.gz | grep "^@" -c # 145,297
zcat C-1_R2.fastq.gz | grep "^@" -c # 146,160
# ⇒ Not much better!
## -----------------------------------------------------------
## Solution B - not universal
## Idea: We extend the search query
## -----------------------------------------------------------
zcat C-1_R1.fastq.gz | head -n 1 # find unique instrument name
# @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
## Solution B1
zcat C-1_R1.fastq.gz | grep "@M01072" -c # N = 144,738
zcat C-1_R2.fastq.gz | grep "@M01072" -c # N = 144,738
## Solution B2
zgrep -c "@M01072" C-1_R1.fastq.gz # N = 144,738
zgrep -c "@M01072" C-1_R2.fastq.gz # N = 144,738
# ⇒ This seems to work. Counts for both files are identical.
# ☛ This is not a universal solution.
## -----------------------------------------------------------
## Solution C
## Idea: Fastq files have 4 lines per record.
## The number of line divided by 4 = number of records
## -----------------------------------------------------------
## Solution C1
zcat C-1_R1.fastq.gz | wc -l # 578,952 / 4 = 144,738
## A few Alternatives:
# zcat < C-1_R1.fastq.gz | wc -l # for zsh
# zgrep -Ec "$" C-1_R1.fastq.gz # count lines
# gzip -cd C-1_R1.fastq.gz | wc -l # redirect gzip stdout
# ⇒ This seems to work too. Counts are identical to solution B.
# ☛ This is a universal solution but we have to some manual work.
## Solution C2
echo "$(zcat C-1_R1.fastq.gz | wc -l) / 4" | bc # N = 144,738
# echo "$(gzcat C-1_R1.fastq.gz | wc -l) / 4" | bc # zsh alternative
# bc <<< "$(wc -l < <(gzcat C-1_R1.fastq.gz)) / 4" # another one
# gzcat C-1_R1.fastq.gz | wc -l | awk '{print int($1/4)}' # one more
# ⇒ This seems to work too. Counts are identical to solution B.
# ⇒ This is a universal solution.
# ☛ It is not a fast way to count sequences.
## -----------------------------------------------------------
## Solution D
## Idea: Count lines starting (^) and ending ($) with + sign.
## -----------------------------------------------------------
zgrep -c "^+$" C-1_R1.fastq.gz # N = 144,738
# ⇒ This seems to work too and it is in agreement with B and C.
# ☛ Careful: It only works is the separator line was not altered.
We have found possible solutions to count the number of reads of a fastq file but are we sure it really works? In other words, do we really have 144'738 reads in this file?
Dummy Data
If you are not 100% sure about your command line, use a well-defined dummy dataset (e.g. extract a smaller subsample). You know what to expect, so you can test your code. You can also (double) check your results by using alternative commands/solutions. This is also a good way to broaden your command repertoire.
## Subset-Data (first 100 records only)
zcat C-1_R1.fastq.gz | head -n 400 | grep "@" -c # N = 142
zcat C-1_R1.fastq.gz | head -n 400 | grep "^@" -c # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | grep "@M01072" -c # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | grep -c "^+$" # N = 100
Extract Records > Subset
A good subset is often helpful to increase processing speed when exploring your data. For example, you would like to estimate possible PhiX (an internal Illumina standard) contamination in your dataset (Mukherjee et al. 2015). Screening a subset would give you an approximate estimate, good enough to decide if a PhiX filtering step is needed.
First we would like to extract a single read.
# Get first read
zcat C-1_R1.fastq.gz | head -n 4 > First_R1.fq
# Get Last read
zcat C-1_R1.fastq.gz | tail -n 4 > Last_R1.fq
# Get a particular read
zgrep -A 3 "@M01072:41:000000000-A942B:1:1101:8091:3906" C-1_R1.fastq.gz > Exract_R1.fq
# > -A: hit and 3 lines after
# Get a random read
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 1 # e.g. @M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1
# sort: -R for random sorting
zgrep -A 3 "@M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1" C-1_R1.fastq.gz > Random_R1.fq
# Get a random subset of reads
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 10 > random10.list
zgrep -A 3 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.fq
However, this may not be a very fast solution. A possible faster alternative is to use seqtk
to extract a random subset. Note that this will work on the GDC server, but will not work on your machine if you do not have seqtk
installed.
## Extract 10 random sequences
seqtk sample C-1_R1.fastq.gz 10 > C-1_R1_subset_10.fq
seqtk sample C-1_R2.fastq.gz 10 > C-1_R2_subset_10.fq
Compare the two subsets and pay attention to the read pairing. Can you improve the subsetting and extract the same reads from the forward (R1) and the reverse (R2) sequence file?
Can you find a way to extract a random subset of read pairs?
We use the seed option (-s[[digit]]
) to keep read-pairs.
## Extract 10 random sequences with seed
seqtk sample -s123 C-1_R1.fastq.gz 10 > C-1_R1_subset_10_sync.fq
seqtk sample -s123 C-1_R2.fastq.gz 10 > C-1_R2_subset_10_sync.fq
## Compare subsets
grep "@" C-1_R1_subset_10_sync.fq | sort > R1_header.txt
grep "@" C-1_R2_subset_10_sync.fq | sort > R2_header.txt
diff R1_header.txt R2_header.txt
Find a way to extract 10 random FAST[A] sequences from C-1_R1.fastq.gz.
## -----------------------------------------------------------
## Solution A
## We only need the first two lines from a fastq record and
## we have to replace @ with >.
## -----------------------------------------------------------
## Get first two lines only (header and sequence)
zgrep -A 1 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.tmp
## Replace @ with >
tr "@" ">" < Random_Subset_N10_R1.tmp > Random_Subset_N10_R1.fa
## Double-check
grep ">" -c Random_Subset_N10_R1.fa
## -----------------------------------------------------------
## Solution B
## seqtk can also convert fq to fa
## -----------------------------------------------------------
seqtk seq -a C-1_R1_subset_10.fq > C-1_R1_subset_10.fa
grep ">" -c C-1_R1_subset_10.fa
Playtime - A waste of time or worth your while?
It may seem like a waste of time to explore your data before you perform quality control (QC). I disagree. QC is an important step before filtering or processing data, but it has its limits. The better you understand your data from the start, the more control you have over it. For example, you have uneven sequencing depth and some samples have few reads or failed. This will help you interpret the QC report. Another example is that you notice an uneven number of reads in the R1 and R2 files. This would indicate that your download was incomplete. You discover a high level of PhiX in your subset. You are running a filter before QC. There are many good reasons to play with your data before jumping headlong into analysis.
Quality Control
For qualitative assessment of raw data we use FastQC. It is easy to use and produces a graphically attractive report. However, care should be taken when interpreting the results. FastQC is a contextual analysis and is designed for genomic data. When assessing quality you need to consider the type of data (e.g. RNA-Seq, Amp-Seq, RAD-Seq) and remember that a red light is not necessarily bad and a green light does not mean all is well.
## Working Directory
if [ ! -d "${HOME}/MDA/qc" ]; then
mkdir -p "${HOME}/MDA/qc"
fi
cd "${HOME}/MDA/qc"
## ---------------
## QC with FastQC
## ---------------
## Help
fastqc -help
## Version
fastqc -version
## Run FastQC on R1 and R2
fastqc -t 1 --kmers 8 C-1_R?.fastq.gz
# t number of threads
FastQC produces two output files for each sequence file. There is an interactive HTML document, and a zipped file with various outputs.
C-1_R1_fastqc.html
C-1_R1_fastqc.zip
C-1_R2_fastqc.html
C-1_R2_fastqc.zip
Use your (S)FTP client (e.g. Cyberduck) to download the HTML files and your web browser to open the report. An alternative would be to download using a terminal and the scp
command. You will need to open a new (local) terminal window for this. For more information, see the "File Exchange" chapter on the Biocomputing page.
scp guest??@gdc-vserver.ethz.ch:/home/guest??/qc/C-1_R*_fastqc.html ${HOME}/qc
scp guest??@gdc-vserver.ethz.ch:/home/guest??/qc/C-1_R*_fastqc.zip ${HOME}/qc
We can also peek inside the zipped folder and open the text-based reports.
## Explore the zipped Results
less C-1_R1_fastqc.zip
unzip -p C-1_R1_fastqc.zip C-1_R1_fastqc/summary.txt | less
FastQC
has a range of interesting options. On the one hand, you can load zipped Fastq files, but other data formats can also be used (e.g. BAM, fast5). You can also search for costum adapters or target sequences (e.g. contaminants).
QC Facts & Considerations
There are a few things you might consider bevore running a quality control.
➼ Subset - Does it make sense to run QC on the entire dataset? Would a subset also work?
➼ Data Structure - QC is based on data structure analysis. You have to know the structure of your data.
➼ Adaptors - Make sure you are looking for the right adaptor sequences.
➼ Quality Drop - The quality decreases with read length.
➼ R1 vs. R2 - Average quality is higher for the first (forward) read (R1) compare to the second (reverse) read (R2).
RTFM (Read The Fine Manual)
Like all applications, FastQC has limitations and settings that the user needs to be aware of. For this reason I think it is better to know a few applications well than many superficially. Reading the manual is a good starting point to understand an application better. I agree with you. Some manuals (but not all!) are terrible to read. Sentences like "use the settings button to change the settings" are not very helpful, but unfortunately common. But at least give each manual a chance, because the author(s) have spent some time on it. The FastQC manual (link below) is an example of a helpful manual. You will learn that some modules (e.g. Duplicate Sequences) use "shortcuts" to increase the speed of analysis. It is important that you know and understand these limitations.
FastQC with R
Windows and FastQC in R
If you use a Linux emulator FastQC for R does not work. Rqc
would be a possible alternative (see below). If you have Ubuntu installed you have to specify the path to the installation.
## Install / Load fastqcr
#install.packages("fastqcr")
library("fastqcr")
## FastQC
fastqc(fq.dir = "${HOME}/qc/", qc.dir = "${HOME}/qc/"))
## FastQC for Windows with Ubuntu
# fastqcr::fastqc("seq.fq.gz", fastqc.path = "wsl.exe /path_to_wsl_fastqc_install/fastqc")
## QC-Plots
qc.C01.R1 <- qc_read(file = "~/qc/C-1_R1_fastqc.zip", modules = "all", verbose = TRUE)
qc_plot(qc.C01.R1, "Per sequence GC content")
qc_plot(qc.C01.R1, "Per base sequence quality")
qc_plot(qc.C01.R1, "Per sequence quality scores")
qc_plot(qc.C01.R1, "Per base sequence content")
qc_plot(qc.C01.R1, "Sequence duplication levels")
## Aggregate FastQC Reports for Multiple Samples
qc <- qc_aggregate(qc.dir = "~/qc/")
## Summary Report
library(dplyr)
qc %>%
select(sample, module, status) %>%
filter(status %in% c("WARN", "FAIL")) %>%
arrange(sample)
## Generate Multi QC Report
qc_report(qc.path = "~/qc/", result.file = "/Users/jwalser/qc",
experiment = "GDA - Test")
More R Packages for QC
Rqc
is an optimized tool designed for quality control and assessment of high-throughput sequencing data. It performs parallel processing of entire files and produces a report which contains a set of high-resolution graphics that can be used for quality assessment.
library(Rqc)
# Depends on: BiocParallel, ShortRead, ggplot2
library(ggplot2)
library(BiocParallel)
library(ShortRead)
Rqc::rqc(path = "/path/to/workfolder", pattern = ".fastq.gz")
# Basic Usage:
# rqc(path = folder, pattern = ".fastq.gz")
# folder: folder with squences
# pattern: regular expression that identifies all files of interest
QC Summary
With MultiQC it is possible to combine individual FastQC reports into one interactive report in HTML format.
QC Screening
FastQ Screen allows you to compare fastq sequences to a number of sequence databases to better understand library composition.
Additional information
Literature
- Mukherjee et al. (2015) Large-scale contamination of microbial isolate genomes by Illumina PhiX control
- Understanding Illumina Quality Scores