Next Generation Sequencing - Introduction

biocomicals

Lecture Notes

⬇︎ NGS Introduction


Rapid advancements in modern DNA sequencing technologies have (and still does) revolutionize biological and medical science.

"The major factors, such as increasing applications of NGS, speed, cost, and accuracy, efficient replacement for traditional technologies, and drug discovery applications demanding NGS technology are expected to drive the growth of the overall market." (ResearchAndMarkets.com)

The ease to produce data has shifted the principal research focus from loci to genomes, increased sample numbers, requires data management strategies, and changed data analysis. As a result, nucleotide sequence archives (ENA, NCBI) are growing daily.

ENA - Reads Growth (2020)

ENA regularly collectis several statistics regarding data growth.

Unfortunately, the high-throughput-sequencing-hype has also removed the research question and researchers have started to sequences everything just because we can. It is paramount to understand the pros and cons of each technology to optimise the sequencing design according to the question(s) and resources at hand.

Sequencing Technology Explained

The suggested movies below will help you better understand the different sequencing technologies. The selection is by no means comprehensive, but it will cover the basics. 

Comparison

There are many detailed reviews available that compare the different NGS technologies. The problem, they are outmoded the moment they are published. Sequence technology is moving fast and it is difficult to keep up with the latest development. Following a more general and personal comparison as an approximate guide to NGS.

Platfom Technology Data Template DNA Lib-Prep Indexing Length Error
Illumina established lots low-high all easy easy short moderate
PacBio established less high high-mol difficult difficult long high-low
ONT experimental less higher high-mol easy difficult long high

The table does not include all technonogies but only the ones the GDC most often handels. The table also ignores solutions for specific applications. We at the GDC believe that the question and all the resources available (not only money) should determin the sequencing technology. There is no point of aiming for long reads if you have not enough and/or highly fragmented starting DNA.

NGS Sequence File Format (Raw Data)

Illumina sequence reads are usually provided as de-multiplexed fastq (i.e. fastq.gz) files. PacBio sequences reads are either provided as fastq, circular consensus sequences (ccs) fasta or BAM files. CCS files can be considered error-corrected sequence reads. Shorter sequence fragments are read multiple time (circled) during a PacBio run and with each pass the accuracy increases. The BAM file can be converted to fastq format using the PacBio BAM recipes. Oxford Nanopore (ONT) raw data are provided as fast5 files and need to be basecalled using e.g. Guppy to obtain fastq files.

PacBio CCS

FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score. For each sample you should have the corresponding FASTQ file (Illumina single-end (SR), PacBio and Nanopore data). For a Illumina paired-end (PE) run, you should have two FASTQ files (i.e. R1 and R2) per sample.

Nucleotide Archives

It is important that you share your data and maybe some other resources with the scientific community. Submit your sequencing raw data to any of the three data repositories. The European Nucleotide Archive (ENA), the GenBank / Sequence Read Archive (SRA) at NCBI, and the DNA DataBank of Japan (DDBJ) are all part of the International Nucleotide Sequence Database Collaboration and exchange data daily. There are other non-profit resource collection such as e.g. DYRAD to promote data publishing, curation, and preservation of all data types.

Challenges

For the following challenges, I assume you are familiar with (NCBI) Blast searches. There are NCBI blast tutorials and NCBI webinars available to get started or to learn more.

First, we generate two 29nt-long random DNA sequences with a GC content of 50% and 20%. For this purpose you could use either the Random DNA generator provided by the Maduro lab at UCR or simply do it by hand.


Random DNA Generator


Possible sequence example:

>Seq1_GC50
CGTAAGGCCATTGCGAATACCAGGTATCG
>Seq2_GC20
AAACTGTTAAAAAATCGTGTCTTTACAAT

Challenge #1: What is the text-based format representing the two nucleotide sequences called?

Solution #1

This is a sequence in FASTA format. Files containing nucleotide sequences carry the suffix fa (seq.fa) or fasta (seq.fasta). If the sequences are aligned the file ending is afa (alignment.afa). See help below to read more about sequence fasta format.

Challenge #2: How many possibilities are there to build a 29nt long sequences? In other words, what is the likelihood that we create the exact same sequences twice (with the same CG content)?

Solution #2

We have 4 possibilities (nucleotides) for each of the 29 sites, resulting in 4^29 = 2.88e17 possible nucleotide sequences. The likelihood is pretty slim!

Add the following "unknown" sequence to the previous two randomly generated sequences.

>Seq3_MysterySequence
AATGATACGGCGACCACCGAGATCTACAC

Challenge #3: What is the GC content of the unknown mystery sequences?

Solution #3

It is ~52% (15/29) and similar to random sequence #1. You could also use comseq from the EMBOSS package with word size 1 to calcualte nucleotides frequencies. See help below for more information.

Next, we search the largest available nucleotide sequence collection to look for similar known sequences. For this purpose, we blast all three sequences against the NCBI nt BLAST-Database. What are your expectations?

My expectations are ...

Seq1: It is a well balanced (GC conten 0.5) random sequence and therfore we expect no perfect hits. The sequence is, however, short and the database is rather large and one-hit wonders cannot be fully excluded. Seq2: There is a good chance that we get some good hits because of the lower complexity. Complexity is an important factor in alinments. Seq3: Not sure. It seems that random blast hits depend on sequence length and complexit (among other factors). If it is a radom sequences, I would also expect no prefect hits similar to radom sequence #1. If there are multiple hits the sequences might not be random at all.

Interpret your Blast results carefully and compare it with your expectations. Do you have any idea what the unknown sequence #3 could be?

Blast results for sequences #3?

The blast result shows multiple perfect hits (100% similarity and 100% coverage). According to the blast results the unknown sequence could be a barn owl, a brown snake, a carp, or a bacterium and even a virus.

Do you have any good explanation for the blast results? Do you agree that it is not a bonanza but something real? Could it be an ultra-conserved genome element (UCE) or is it an artefacts? Maybe the internet can help? The sequence is short enough to be used in a Google search.

What is sequences #3?

It is not an UCE but more likely the result of careless NGS data preparation. The sequences is a perfect match with oligonucleotides used in Illumina (Nextera) library preparation. It is part of an adaptor-sequences you add to your target DNA fragments in order to sequence it. NGS quality reports should indicate adaptor contaminations and there are many tools freely available to remove such artificially introduced sequences prior to assembly.

Have a look at the Illumina adaptor sequences (link below) and blast a few more. You will see that the adaptor-contamination problem is widespread.

Tip

For the correct data preparation and analysis of your NGS data, it is important to know the detail about sample preparation and the sequence technology used. The more you know, the better you understand your data and possible artefacts.

Help

Nucleotide Archives

Reading / Watching