NGS Introduction

Lecture notes¶

NGS Sequence File Format (Raw Data)¶

Illumina sequence reads are usually provided as de-multiplexed fastq (i.e. fastq.gz) files. PacBio sequences reads are either provided as fastq, circular consensus sequences (ccs) fasta or BAM files. CCS files can be considered error-corrected sequence reads. Shorter sequence fragments are read multiple time (circled) during a PacBio run and with each pass the accuracy increases. The BAM file can be converted to fastq format using the PacBio BAM recipes. Oxford Nanopore (ONT) raw data are provided as fast5 files and need to be basecalled using e.g. Guppy to obtain fastq files.

FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score. For each sample you should have the corresponding FASTQ file (Illumina single-end, PacBio and Nanopore data), and, for a Illumina paired-end run, you should have two FASTQ files (i.e. R1 and R2) per sample.

# Single Reads (SR)
sample1.fastq.gz

# Paired-end Reads (PE)
sample1_R1.fastq.gz
sample1_R2.fastq.gz

Compressed Fastq Files - A Word of Advise¶

Fastq sequences should be compressed to reduce storage space (cost)! It does not take too much time to decompress it if needed. You also need compressed files to submit to the Nucleotide Sequence Archive. Remember, never decompress the read data in the project directory. Use scratch space or your working directory which are not backed up. Please also consider that many program (including FASTQC) can handle compressed files. So no need to decompress the files after all.

Linux include various commands for compressing and de-compressing files.

## Get the file
wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/data/C-1_R1.fastq.gz" 
wget "http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/data/C-1_R2.fastq.gz" 

## Create a test file
zcat C-1_R1.fastq.gz | head -n 40 > subset_C-1_R1.fq
head -n 4 subset_C-1_R1.fq
zcat C-1_R2.fastq.gz | head -n 40 > subset_C-1_R2.fq
tail -n 4 subset_C-1_R2.fq

## Using GNU zip
gzip subset_C-1_R1.fq      # creates subset_C-1_R1.fq.gz
gunzip subset_C-1_R1.fq.gz # unzip file

## Learn more about gzip
man gzip. # get manual for gzip

## tar-Archiv (tar archives - tarball)
tar cfvz subset_C-1_R12.fq.tar.gz subset_C-1_R1.fq subset_C-1_R2.fq
tar xfvz subset_C-1_R12.fq.tar.gz

# tar: uncompressed archive file
# gz : file (archive or not) compressed using gzip

Q1 What terminal command(s) can you use to peek inside a compressed sequence file without decompressing the file?

Solution #1 - Please think before you click!

## Solution #1a - 
zcat C-1_R1.fastq.gz | head -n 4
## Solution #1b - 
gzip -cd C-1_R1.fastq.gz | head -n 4

Q2 What is the compression rate for a gz file compared to the unzipped file?

Solution #2 - Please think before you click!

## File size of gz file => 25M
ll -h C-1_R1.fastq.gz

## File size unzipped => 72M (compression rate ~3x)
gunzip -c C-1_R1.fastq.gz > C-1_R1.fastq
ll -h C-1_R1.fastq # human readable file size

## Get some information about the archive file
gzip -l C-1_R1.fastq.gz
## Calculate size of uncompressed file
gzip -cd C-1_R1.fastq.gz | wc -c

Warning

Be careful with unzipping file(s). Make sure there is enough disk space (e.g. df -h) and ask yourself if you still need the archive file(s) once the data is extracted.

FASTQ Data Format¶

Each entry (read/sequence) in a FASTQ files consists of 4 lines:

A sequence identifier with information about the sequencing run and the cluster.
The sequence (the base calls; A, C, T, G and N).
A separator, which is simply a plus (+) sign.
The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

FASTQ files can contain millions of entries and can be several gigabytes in size, which often makes them too large to open in text editor. Generally, it is not necessary to view FASTQ files, because they are intermediate output files used as input for tools that perform downstream analysis, such as alignment to a reference or de novo assembly. We can use the terminal to manage FASTQ files without open them.

head -n 4 subset_C-1_R1.fq    # displays the first four lines - first sequence
tail -n 4 subset_C-1_R1.fq    # displays last sequence record
grep "^@" -c subset_C-1_R1.fq # counts all lines starting with @

Q3: What could be the problem if you use the grep command above to count the number of records?

Solution #3

The ascii character **@** is also used to encode phred score 31 and it is possible to count quality encoding lines too.

## R code - ASCII-Code Phred-Score Encoding Table
coderange <- c(33:74)
asciitable <- data.frame(coderange, coderange-33, row.names=rawToChar(as.raw(coderange), multiple=TRUE))
colnames(asciitable) <- c("ASCII","PhredScore")
asciitable

Q4: Can you find a better solution to count the number of records in an fastq file?

Solution #4

## Count number of records - !- NOT correct -!
grep "^@" -c C-1_R1.fastq # n = 145297

## Solution #4a
wc -l C-1_R1.fastq # 578952 / 4 = 144738 (number of line divided by 4 = number of records)

## Solution #4b
cat C-1_R1.fastq | echo $((`wc -l`/4)) # n = 144738

## Solution #4c
head -n 1 sequence.f*q # find unique instrument name
# **@M01072**:41:000000000-A942B:1:1101:11853:2457 1:N:0:1
grep "^@M01072" -c C-1_R1.fastq # n=144738

## Solution #4d
grep "^+" -cw C-1_R1.fastq # start and ends with a + # n = 144738

Tip

If you are not 100-percent sure about your command line use a simple dummy data set, extract a small subsample and double check by hand. You could also verify your results by using alternative commands. This is also a good way to expand your repertoire of commands.

Q5 Find a solution to extract a particular read from the fastq file and safe it as a fastq text file.

Solution #5

# Get: @M01072:41:000000000-A942B:1:1101:8091:3906
grep "@M01072:41:000000000-A942B:1:1101:8091:3906" -w C-1_R1.fastq -A 3 > read.fastq

Q6 Find a solution to extract 10 random reads from the fastq file.

Solution #6

grep "^@" C-1_R1.fastq | sort -R | head -n 10 > r10.list
grep -f r10.list -A 3 --no-group-separator C-1_R1.fastq > r10.fq
# sort -R for random sorting
# after the list grep, records are spaced by --, I use second grep to remove them

Tip

There are faster ways (e.g. UEARCH:fastx_getseqs, seqtk:subseq) to extract random subsamples from fastx files.

Nucleotide Read Archive Submission¶

In order to publish your study you have to give people access to your raw data. Meaning you have to upload the (archived) raw reads to a sequence read archive (e.g. ENA or NCBI) to make it available to the public. In order to verify the data transfer a fingerprint (checksum) has to be generated for each file submitted.

Exercise #1

Generate md5sum for both sequence files. The command md5sum calculates a message-digest fingerprint (checksum) for a file.

md5sum subset_C-1_R1.fq subset_C-1_R2.fq

Exercise #2

Create a copy of the fastq file and generate a fingerprint for both copies.

cp subset_C-1_R1.fq subset_C-1_R1_copy.fq
md5sum subset_C-1_R1.fq subset_C-1_R1_copy.fq

Exercise #3

Make a small change (e.g. delete the last character or introduce a new line (\n) at the end) in the ex2_B.fq and compare the fingerprints again.

# e.g. remove last character of the file 
cat subset_C-1_R1_copy.fq | sed '$ s/.$//' > subset_C-1_R1_copy_new.fq
tail -n 1 subset_C-1_R1_copy.fq
tail -n 1 subset_C-1_R1_copy_new.fq
md5sum subset_C-1_R1.fq subset_C-1_R1_copy.fq subset_C-1_R1_copy_new.fq

Note

md5sum is not universal for all OS! For Mac bash terminal see man md5.

NGS Introduction

Lecture notes¶

NGS Sequence File Format (Raw Data)¶

Compressed Fastq Files - A Word of Advise¶

FASTQ Data Format¶

Nucleotide Read Archive Submission¶

Links¶

Software Links¶

Nucleotide Archives¶

Sequencing Technology Links¶