Sequence Quality Control

Electropherogram

Lecture Notes

⬇︎ QC

Base-Calling

The measured signals (e.g. fluorescent light, electrical current changes) must be assigned to nucleotide bases. This step is called base calling. The accuracy of base calling is crucial for the quality of DNA sequencing. There are several base-calling methods that differ in accuracy and speed. With Illumina, base calling is (usually) a part of the sequencing workflow and image analysis, base calling, and base call quality is done automatically (e.g. MSR - MiSeq Reporter Software). The results are fasta sequences with quality information (fastq). PacBio is providing (base-called) BAM files that can be converted into fastq files. Oxford Nanopore Technology (ONT) is different. For example, MinKNOW produces FAST5 (HDF5) files and real-time base-calling (online) can be enabled for immediate sequence access if needed. The produced FAST5 files contain raw signal data that can be used for onboard base calling and demultiplexing with e.g. Guppy. The translation of the raw signal into nucleotide bases is error-prone. The reliability of the assignment is expressed as a probability and coded to save space. The sequence and the probability are wept together in a text-based file format called fastq.

FASTQ File Format

The FASTQ file format is a fixed data format that contains both the nucleotide sequence (line 2) and the corresponding quality scores (line 4).

@m54073_190302_105548/4325680/ccs.........................⇽ sequence header
TGGATCACTTGTGCAAGCATCACATCGTAGAGCTTGTCCTTGGTCTGCGA........⇽ sequence 
+.........................................................⇽ extra line for notes
p~~L~jubw~n~V~2~u~[sygSN~g~K~k~~~o~u}q}|is~~~~~~~k........⇽ quality encoding

Quality Score

Phred based quality score are widely used to represent probabilities and confidence scores in the assignment of a base by the base-caller software. The quality scores range from 0 (2) to 40, with some variations in the range. A higher score shows a higher probability that a particular base call is correct. The Phred quality score (Q) is logarithmically related to the error probability (E).

$Q = {-10*\log_{10}E} \Rightarrow E = {10^{-{Q\over10}}}.$

Quality Encoding

The Quality scores (0-40) are encoded with single ASCII character (!-I).

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI       ➞  Encoding Character
 |        |         |         |         | 
 34       43        53        63        73      ➞  ASCII Code
 ⇣        ⇣         ⇣         ⇣         ⇣ 
 1        10        20        30        40      ➞  Q-Score (ASCII Code - 33)
 ⇣        ⇣         ⇣         ⇣         ⇣
 21%      90%       99%       99.9%     99.99%  ➞  Base call accuracy

Get Ready

File Size

FASTQ sequence files can be big. It might contain millions of entries and can be several gigabytes in size. Best strategy is to work with compressed files.

Go to the working directory and make sure the zipped fastq file is there.

## Working Directory
cd ${HOME}/MDA21/qc
ls -lh C-1_R[12].fastq.gz
# If the files and/or folder are missing create the folder and/or download the files.
# # Go home
# cd ${HOME}
# # Create working directory
# mkdir qc; cd qc
# # Download sequence files
# wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/data/C-1_R1.fastq.gz" 
# wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/data/C-1_R2.fastq.gz"

Basics Steps

Before we analyse the data, we should first explore the read files. A better understanding of the data will help us to optimize data preparation and it will provide more confidence.

Completeness

Have a look at the first and last four lines.

## Test Completeness
zcat C-1_R[12].fastq.gz | head -n 4 # first 4 entry
zcat C-1_R[12].fastq.gz | tail -n 4 # last 4 entry

Encoding Example:

1 @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
2 GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGTGC...
3 +
4 >>1>>11>11>>1EC?E?CFBFAGFC0GB/CG1EACFE/BFE///AEG1DF122A...
  | |          |
  | |          └─ C ➞ E ➞ 36 Phred Quality Score (Q) ➞ 99.975 Base call accuracy (P)
  | └─ G ➞ 1 ➞ 16 Phred Quality Score (Q) ➞ 97.488 Base call accuracy (P)
  └─ G ➞ > ➞ 29 Phred Quality Score (Q) ➞ 99.874 Base call accuracy (P)

➜ Pos. #1 | Nuc. G | Character Encoding [>]
Q = ascii -s ">" | awk '{print $2-33}' = 29
P = $100-( 10^{-2.9} *100)$ = 99.874

➜ Pos. #3 | Nuc. G | Character Encoding [1]
Q = ascii -s 1 | awk '{print $2-33}' = 16
P = $100-( 10^{-1.6} *100)$ = 97.488

➜ Pos. #14 | Nuc. C | Character Encoding [E]
Q = ascii -s E | awk '{print $2-33}' = 36
P = $100-( 10^{-3.6} *100)$ = 99.975

Read Count

Next, we would like to count the sequences/reads in the fastq files. It is also important to check if we have equal number of forward (R1) and reverse reads (R2). Similar to the fasta file example, we could count the number of headers using grep with option c. But, instead of counting lines with > we could look for lines with @.

zcat C-1_R1.fastq.gz | grep "@" -c
# N = 203,083

⇒ There are 203,083 reads in the file.

Do you see a problem with this solution?


  ## -----------------------------------------------------------
  ## Test - N(R1) == N(R2)
  ## We have paired-end reads (R1 and R2) and we would expect
  ## to have equal number or reads in both files.
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | grep "@" -c
  # N = 203,083
  zcat C-1_R2.fastq.gz | grep "@" -c
  # N = 239,369

⇒ The count is different for R1 and R2. Something is wrong!

  ## -----------------------------------------------------------
  ## ASCII Table
  ## The ascii character @ is also used to encode phred score 31
  ## and it is possible to count quality encoding lines too.
  ## -----------------------------------------------------------

  ## R code - ASCII-Code Phred-Score Encoding Table
  coderange <- c(33:74)
  asciitable <- data.frame(coderange, coderange-33, row.names = rawToChar(as.raw(coderange), multiple = TRUE))
  colnames(asciitable) <- c("ASCII","PhredScore")
  asciitable

Can you find a better (correct) solution to count the number of records in an fastq file?



  ## -----------------------------------------------------------
  ## Solution A - not really a solution :-(
  ## Idea: Fastq headers start with a @.
  ##       We only count lines that start with a @.
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | grep "^@" -c  # 145,297
  zcat C-1_R2.fastq.gz | grep "^@" -c  # 146,160
  # ⇒  Not much better!

  ## -----------------------------------------------------------
  ## Solution B - not universal
  ## Idea: We extend the search query
  ## -----------------------------------------------------------

  zcat C-1_R1.fastq.gz | head -n 1 # find unique instrument name
  # @M01072:41:000000000-A942B:1:1101:11853:2457 1:N:0:1

  ## Solution B1
  zcat C-1_R1.fastq.gz | grep "@M01072" -c # N = 144,738
  zcat C-1_R2.fastq.gz | grep "@M01072" -c # N = 144,738

  ## Solution B2
  zgrep -c "@M01072" C-1_R1.fastq.gz       # N = 144,738

  # ⇒  This seems to work. Counts for both files are identical.
  # ☛  This is not a universal solution. 

  ## -----------------------------------------------------------
  ## Solution C
  ## Idea: Fastq files have 4 lines per record. 
  ##       The number of line divided by 4 = number of records
  ## -----------------------------------------------------------

  ## Solution C1
  zcat C-1_R1.fastq.gz | wc -l     # 578,952 / 4 = 144,738
  zgrep -Ec "$" C-1_R1.fastq.gz    # 578,952 / 4 = 144,738
  gzip -cd C-1_R1.fastq.gz | wc -l # 578,952 / 4 = 144,738

  # ⇒  This seems to work too. Counts are identical to solution B.
  # ☛  This is a universal solution but we have to some manual work.

  ## Solution C2
  zcat C-1_R1.fastq.gz | echo $((`wc -l`/4)) # N = 144,738

  # ⇒  This seems to work too. Counts are identical to solution B.
  # ☛  This is a universal solution.

  ## -----------------------------------------------------------
  ## Solution D
  ## Idea: Count lines starting (^) and ending ($) with + sign. 
  ## -----------------------------------------------------------

  zgrep -c "^+$" C-1_R1.fastq.gz # N = 144,738

  # ⇒  This seems to work too and it is in agreement with B and C.
  # ☛  Careful: It only works is the separator line was not altered.

We have found possible solutions to count the number of reads of a fastq file but are we sure it really works? In other words, do we really have 144,738 reads in this file?

Tip

If you are not 100-percent sure about your command line, use a well defined dummy data set (e.g. extract a smaller subsample). You could also verify your results by using alternative commands/solutions. This is also a good way to expand your repertoire of commands.

## Subset-Data
zcat C-1_R1.fastq.gz | head -n 400 | grep "@" -c
zcat C-1_R1.fastq.gz | head -n 400 | grep "^@" -c
zcat C-1_R1.fastq.gz | head -n 400 | grep "@M01072" -c   # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | echo $((`wc -l`/4)) # N = 100
zcat C-1_R1.fastq.gz | head -n 400 | grep -c "^+$"       # N = 100

Extract Records

A good subset is often helpful to increase processing speed while exploring your data. For example, you would like to estimate possible PhiX (an internal Illumina standard) contamination in your dataset (Mukherjee et al. 2015). Screening a subset would give you an approximate estimate good enough for you to decided if a PhiX-filtering step is needed.

First we would like to extract a single read.

# Get first read
zcat C-1_R1.fastq.gz | head -n 4 > First_R1.fq

# Get Last read
zcat C-1_R1.fastq.gz | tail -n 4 > Last_R1.fq

# Get a particular read
zgrep -A 3 "@M01072:41:000000000-A942B:1:1101:8091:3906" C-1_R1.fastq.gz > Exract_R1.fq
# > -A: hit and 3 lines after

# Get a random read
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 1  # e.g. @M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1
# sort: -R for random sorting
zgrep -A 3 "@M01072:41:000000000-A942B:1:1107:11792:27818 1:N:0:1" C-1_R1.fastq.gz > Random_R1.fq

# Get a random subset of reads
zgrep "@M01072" C-1_R1.fastq.gz | sort -R | head -n 10 > random10.list
zgrep -A 3 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.fq

This might not be a fast solution. We could use seqtk to extract a random subset faster. This works on the GDC server but will not work on your computer if seqtk in not installed.

## Extract 10 random sequences
seqtk sample C-1_R1.fastq.gz 10 > C-1_R1_subset_10.fq
seqtk sample C-1_R2.fastq.gz 10 > C-1_R2_subset_10.fq

Compare the two subsets and pay attention to the read pairing. Can you improve the subsetting and extract the same reads from the forward (R1) and the reverse R2 file?

Can you find a way to extract a random subset of read pairs?

We use the seed option (`-s[[digit]]`) to keep read-pairs.


  ## Extract 10 random sequences with seed
  seqtk sample -s123 C-1_R1.fastq.gz 10 > C-1_R1_subset_10_sync.fq
  seqtk sample -s123 C-1_R2.fastq.gz 10 > C-1_R2_subset_10_sync.fq

Find a way to extract 10 random FAST[A] sequences from C-1_R1.fastq.gz.


  ## -----------------------------------------------------------
  ## Solution A
  ## We only need the first two lines from a fastq record and
  ## we have to replace @ with >.
  ## -----------------------------------------------------------

  ## Get first two lines only (header and sequence)
  zgrep -A 1 --no-group-separator -f random10.list C-1_R1.fastq.gz > Random_Subset_N10_R1.tmp
  ## Replace @ with >
  tr "@" ">" < Random_Subset_N10_R1.tmp > Random_Subset_N10_R1.fa
  ## Double-check
  grep ">" -c Random_Subset_N10_R1.fa

  ## -----------------------------------------------------------
  ## Solution B
  ## seqtk can also convert fq to fa
  ## -----------------------------------------------------------

  seqtk seq -a C-1_R1_subset_10.fq > C-1_R1_subset_10.fa
  grep ">" -c C-1_R1_subset_10.fa

Playtime - A waste of time or worth your while?

It might seem a waste of time to explore your data before you run quality control (QC). I beg to differ. QC is an important step prior to the data filtering or processing but it has limits. The better you understand your data from the beginning, the more control you have over your data. For example, you have uneven sequencing depth and some samples have only a few reads or failed. This will help you with the QC report interpretation. Another example, you notice an uneven number of reads in the R1 and R2 files. This would show that your download was incomplete. You discover a high level of PhiX in your subset. You run a filter prior to QC. There are more good reasons to play with your data.

Quality Control

For the qualitative assessment of the raw data we use FastQC. It is easy to use and produces a graphically appealing report. However, one should be careful when interpreting the results. FastQC is a contextual analysis and is designed for genomic data. When assessing the quality, one has to consider the data type (e.g. RNASeq, AmpSeq, RADseq). A red traffic light is not necessarily bad and a green traffic light is not necessarily good.

## ---------------
## QC with FastQC
## ---------------

## Help
fastqc -help

## Version
fastqc -version

## Run FastQC on R1 and R2
fastqc -t 1 --kmers 8 C-1_R?.fastq.gz
# t number of threads

FastQC produces two output files per sequence file. There is an interactive HTML document, and a zipped file with various outputs.

C-1_R1_fastqc.html
C-1_R1_fastqc.zip
C-1_R2_fastqc.html
C-1_R2_fastqc.zip

Use your (S)FTP client (e.g. Cyberduck) to download the HTML files and your web browser to open the report. An alternative would be a download with the terminal and the scp command. For this you have to open a new (local) terminal window.

scp student??@gdcsrv2.ethz.ch:/gdc_home/student??/qc/C-1_R*_fastqc.html ${HOME}/qc
scp student??@gdcsrv2.ethz.ch:/gdc_home/student??/qc/C-1_R*_fastqc.zip ${HOME}/qc

We can also peek inside the zipped folder and open the text-based reports.

## Explore the zipped Results
less C-1_R1_fastqc.zip
unzip -p C-1_R1_fastqc.zip C-1_R1_fastqc/summary.txt | less

FastQC has a range of interesting options. On the one hand, you can load zipped Fastq files, but other data formats are also read (e.g. BAM, fast5). One can search specifically for known sequences (e.g. impurities or adpators).

QC Facts & Considerations

There are a few things you might consider bevore running a quality control.

➼ Subset - Does it make sense to run QC on the entire dataset? Would a subset also work?
➼ Data Structure - QC is based on data structure analysis. You have to know the struture of you data.
➼ Adaptors - Make sure you are looking for the right adaptor sequences.
➼ Quality Drop - The quality decreases with read length.
➼ R1 vs. R2 - Average quality is higher for the first (forward) read (R1) compare to the second (reverse) read (R2).

RTFM

Like all applications, FastQC has limits and setting a user has to be aware of. For this reason, i think it is better to know a few applications well than many superficially. Reading the manual is a good starting point to better understand an application. I agree with you. Some manuals (but not all!) are awful to read. Sentences like “use the setting button to change the settings” are not very helpful but unfortunately wide spread. But at least give each manual a chance because the author(s) did spend some time on it. The FastQC manual (like below) is an example of a good manual. You will learn that some modules (e.g. Duplicate Sequences) use "shortcuts" to increase analysis speed. It is important you know and understand this restrictions.

FASTQC with R

## Install / Load fastqcr
#install.packages("fastqcr")
library("fastqcr")

## QC-Plots
qc.C01.R1 <- qc_read(file = "~/qc/C-1_R1_fastqc.zip", modules = "all", verbose = TRUE)

qc_plot(qc.C01.R1, "Per sequence GC content")
qc_plot(qc.C01.R1, "Per base sequence quality")
qc_plot(qc.C01.R1, "Per sequence quality scores")
qc_plot(qc.C01.R1, "Per base sequence content")
qc_plot(qc.C01.R1, "Sequence duplication levels")

## Aggregate FastQC Reports for Multiple Samples
qc <- qc_aggregate(qc.dir = "~/qc/")

## Summary Report
library(dplyr)
qc %>%
  select(sample, module, status) %>%    
  filter(status %in% c("WARN", "FAIL")) %>%
  arrange(sample)

## Generate Multi QC Report
qc_report(qc.path = "~/qc/", result.file = "/Users/jwalser/qc",
          experiment = "GDA21 - Test")

Additional information

Literature

Mukherjee et al. (2015) Large-scale contamination of microbial isolate genomes by Illumina PhiX control

ReadTheFine Manual