Insights & Solutions - GDC First Aid Kit

Short on Length, Long on Insight

While longer reads can enhance taxonomic resolution in amplicon sequencing, their effectiveness depends on the chosen marker gene or region. Additionally, longer reads tend to have higher error rates, which can affect both clustering and annotation.

Longer Reads Improve Resolution - But Not Always

Long reads (e.g., PacBio or Oxford Nanopore full-length 16S rRNA) generally provide better resolution than short reads (e.g., Illumina V3-V4). However, some taxa remain indistinguishable even with full-length sequencing. For instance, the Shigella-Escherichia complex has nearly identical 16S rRNA sequences, making it unresolvable. Similarly, certain fungal ITS regions may lack sufficient variation to differentiate closely related species.

Higher Error Rates Impact Clustering and Annotation

While long reads capture more sequence information, they also introduce higher error rates, especially with nanopore sequencing. This can lead to errors in clustering operational taxonomic units (OTUs) or amplicon sequence variants (ASVs), ultimately affecting annotation accuracy.

Consider Amplicon Coverage and Database Limitations

Reference databases are often gene-specific. For example, ITS databases typically focus on the ITS1-ITS2 region and exclude LSU (28S) or SSU (18S) sequences. If sequencing captures flanking non-target regions, trimming or extracting only the relevant region can improve annotation accuracy.

Kinnex: Boosting Long-Read Amplicon Sequencing

PacBio's Kinnex enhances long-read amplicon sequencing by providing more efficient library preparation and improved sequencing throughput. It simplifies the process of generating high-quality data from complex microbial communities, increasing accuracy and reducing bias. This technology supports high-resolution analysis of target regions, making it ideal for applications requiring precise taxonomic identification and variant detection. Kinnex improves the scalability and reliability of long-read sequencing, ensuring better results for metagenomics and other amplicon-based studies.

Wobble Bases: The Good, The Bad, and The Degenerate

In primer design for amplicon-based sequencing, wobble bases (or degenerate sites) are often used to increase the range of species to which a given primer can bind. These bases allow for the incorporation of multiple nucleotide options at a given position in the primer, which can accommodate genetic variability between different species or strains.

The main advantage of using wobble bases is that they can increase the universality of primers. This is particularly valuable when working with complex or diverse microbial communities, where the target sequence may vary slightly between species. Wobble bases help ensure that primers bind to as many target sequences as possible, making the PCR process more inclusive and increasing the chances of successful amplification from a wide range of organisms.

However, the use of wobble bases can present challenges and potential problems. One of the main issues is that the inclusion of degenerate sites can reduce the specificity of the primer. If multiple base options are allowed at a given position, the primer may bind to non-target sequences or to multiple homologous sequences with slightly different sequences. This can lead to the amplification of non-specific products that complicate downstream analysis. These non-specific amplicons may lead to misinterpretation of results or require additional steps to remove them.

Another issue is that the use of wobble bases can affect primer efficiency. The melting temperature (Tm) of the primer may be less predictable when using degenerate bases, as each base combination may affect the stability of the primer-template duplex differently. This may result in less efficient binding, reduced PCR amplification or even failure to amplify.

In addition, the frequency of nucleotides at wobble base sites can vary between primer batches, which can lead to inconsistent primer performance. This variability can introduce bias in certain applications, such as metagenomic studies, where the relative abundance of different taxa may be distorted due to primer composition.

Moreover, if too many wobble bases are included, the number of primer variants increases significantly, making the overall primer design more complex. This can lead to challenges in managing primer consistency and efficiency across experiments, as well as complicating the optimisation process.

PCR: The Fine Art of Perfecting Your Amplification

The "garbage in, garbage out" (GIGO) principle is particularly relevant to amplicon-based sequencing. A poorly prepared library cannot be rescued by sequencing, making every step of the process critical. The first targeted PCR is often crucial for success - perfect primers alone won't guarantee good results if the PCR conditions are not optimised. Annealing temperature, cycle number, reagent concentrations and polymerase choice all influence the outcome. Careful in silico planning is valuable, but at some point optimisation must move into the lab. Only by empirically testing and fine-tuning conditions can you achieve the best possible sequencing results.

Choosing the Right DNA Polymerase for Amplicon Sequencing

Selecting the right DNA polymerase is essential for accurate and efficient amplification. Fidelity is critical, as high-error polymerases can introduce sequencing artefacts, especially in applications such as variant detection. Processivity and speed affect yield and efficiency, while tolerance to inhibitors is important when working with complex environmental.

Polymerases also differ in their endo- and exonuclease activities, which affect error correction and primer-template interactions. Those with 3'→5' exonuclease (proofreading) activity improve accuracy by removing mismatched bases, but may reduce amplification efficiency for certain targets. Meanwhile, polymerases with endonuclease activity can digest mismatched primers at the 3' end, potentially leading to false priming and inefficient amplification. Finally, whether a polymerase produces blunt or A-tailed (sticky) ends affects downstream applications such as cloning or library preparation. Careful consideration of these factors will ensure reliable, high quality sequencing results.

Why Optimizing PCR Conditions Matters

Optimising PCR isn't just about getting a product - it's about getting the right product. While in silico PCR predictions help set expectations and guide primer design, the actual amplification conditions must be tailored to your specific setup. There's no universal protocol that works for every experiment; success depends on how PCR reagents, thermocycler performance and cycling parameters interact.

Two key factors - annealing temperature and extension time - play a critical role in specificity and accuracy. Lower annealing temperatures increase the risk of non-specific binding, as primers may anneal to partially matching sequences. Long extension times, especially for polymerases with endonuclease activity, can increase the likelihood of false priming by allowing mismatched primers to be digested at the 3' end. To minimise artefacts and obtain high quality, reproducible results, it's important to experiment with variables such as cycle number, Mg²⁺ concentration and polymerase selection.

Why More PCR Cycles Can Be Problematic

Running extra PCR cycles to increase product yield may seem like a simple solution, but it carries several risks. One major issue is amplification bias - some fragments amplify more efficiently than others, leading to overrepresentation of certain sequences while others are lost. This is particularly problematic in amplicon sequencing, where maintaining an unbiased representation of the original template is crucial.

Another concern is the accumulation of PCR errors. Even high-fidelity polymerases introduce occasional errors, and as the number of cycles increases, these errors become more frequent. Excessive cycling also increases the risk of chimera formation, where incomplete DNA strands anneal to unrelated frag ments, creating artificial sequences that were not present in the original sample.

A better approach is to use fewer PCR cycles and pool several independent reactions. This strategy minimises amplification bias, reduces errors and preserves low abundance sequences. Running multiple lower-cycle reactions and combining them ensures a more accurate representation of the original DNA, while still producing enough product for downstream applications.

Identifying and Addressing False Priming in Amplicon Sequencing: The Importance of Pilot Studies and Primer Mismatch Analysis

False priming in amplicon sequencing can be a challenge, and at the GDC, we strongly recommend that you conduct pilot studies to gain a deeper understanding of your samples before committing to large-scale sequencing projects. Once a pilot study has been completed, there are several ways to investigate false priming. One of the most effective and straightforward methods is to examine mismatches at primer binding sites in the raw reads and compare them to other regions within the same reads. For example, primer mismatches can often be observed as follows

R1 Primer-Site Mis-Matches (top 10):

 ...
 0.3% 5'-.............S...-3
 0.3% 5'-.......G.........-3
 0.3% 5'-............W.CA.-3
 0.4% 5'-.....G...........-3
 0.4% 5'-...............A.-3
 0.5% 5'-................G-3
 0.5% 5'-......G..........-3
 0.5% 5'-..............CAG-3
 0.6% 5'-...............AG-3
65.1% 5'-.................-3' (no mismatches)

R2 Primer-Site Mis-Matches (top 10):

 ...
 0.4% 5'-..........G.........-3
 0.4% 5'-.................AA.-3
 0.4% 5'-................TAA.-3
 0.5% 5'-...........G........-3
 0.5% 5'-................TA.T-3
 0.5% 5'-.....A..............-3
 0.7% 5'-...................T-3
 0.9% 5'-................T.AT-3
 1.3% 5'-.................AAT-3
63.5% 5'-....................-3 (no mismtaches)

Firstly, the frequency of perfect primer sites (i.e. no mismatches) in these regions is relatively low (~65%). Secondly, the diversity in the primer binding site region is unexpectedly high, further indicating potential problems. In addition, these mismatches tend to cluster at the 3' end of the primers, possibly indicating footprints of endonuclease activity, which increases the likelihood of false priming. This enzymatic interference may contribute to false primer binding. Investigation of these mismatches provides valuable insight into PCR-related problems, enabling more informed decisions and refinement of sequencing approaches.

OTUs, ASVs, and the Database Dilemma

Metabarcode or amplicon sequencing uses both Operational Taxonomic Units (OTUs) and Amplicon Sequence Variants (ASVs) (also known as ZOTUs) to group sequences based on similarity or exact variant identity. Once OTUs or ASVs are identified, they are typically annotated by assigning a taxonomic label (such as genus or species) based on comparison to reference databases. However, these annotations are predictions, not definitive answers, and several factors can influence their accuracy.

Factors Affecting Annotation Accuracy

Database quality: The accuracy of taxonomic predictions depends on the completeness and quality of the reference databases. If an OTU or ASV does not have a closely matching sequence in the database, the annotation may be uncertain or incorrect. This is particularly problematic for underrepresented or poorly studied organisms.
Sequence similarity: The more similar an OTU or ASV is to known sequences in the database, the more reliable the annotation. However, sequencing errors, biological variability or the presence of closely related species can make predictions difficult.
Clustering threshold (for OTUs): OTUs are clustered based on a certain similarity threshold (e.g. 97%), which can influence how fine or broad the taxonomic groups are. ASVs, on the other hand, represent exact sequence variants, which can provide higher resolution but may also be more susceptible to sequencing errors.
Environmental Influence: Factors like rare or novel species in the environment can affect annotation accuracy, especially when databases lack references for these organisms.

Taxonomic Prediction Approaches and Tools

Different software applications and algorithms use different methods to assign taxonomic labels to OTUs or ASVs. Some tools rely on simple BLAST-based methods, while others use more advanced machine learning or statistical models. For example, SINTAX (a classifier based on Bayesian statistics) is a widely used tool, but there are other approaches that may perform better depending on the dataset and research question.

SINTAX: This classifier is based on a fast, memory-efficient algorithm that assigns taxonomy by comparing query sequences to a reference database (often the SILVA or Greengenes databases). It's particularly effective for high-throughput sequencing data because it's relatively fast, although it can be less accurate for low-abundance taxa or in environments with a high number of novel sequences.
RDP Classifier: Another popular tool that uses a naive Bayesian classifier to assign taxonomy. It is widely used and generally reliable, but its performance can degrade with sequences that are very distantly related to reference sequences or if the database is incomplete.
QIIME’s Taxonomy Assignment: This method often uses the BLAST algorithm for sequence matching. While QIIME's taxonomy assignment is reliable for well-characterised taxa, its performance may be inconsistent for low abundance or novel sequences.

OTU and ASV annotation are essential but predictive steps in metabarcode and amplicon sequencing analysis. The annotation process is influenced by the quality of reference databases, sequence similarity and the method used for taxonomic classification. It's important to remember that these annotations are based on comparisons with known sequences and errors can occur. The choice of classifier - such as SINTAX, RDP or QIIME's method - depends on the specific needs of the study, including speed, accuracy and handling of novel sequences. Researchers should always interpret taxonomic predictions with caution and, where possible, combine them with additional experimental methods to validate their results.

Reference Databases in Taxonomic Annotation

In metabarcode or amplicon sequencing, the accuracy of OTU or ASV annotation is highly dependent on the reference database used for comparison. There are several off-the-shelf reference databases available for taxonomic assignment, each designed for specific taxonomic groups or sequencing approaches. Some commonly used databases are

SILVA: A comprehensive and widely used database primarily for ribosomal RNA (rRNA) sequences, particularly useful for sequencing markers such as 16S and 18S. SILVA is highly curated and regularly updated, but is most effective for studies focusing on bacterial, archaeal and eukaryotic microbes.

UNITE: Specialises in fungal ITS (Internal Transcribed Spacer) sequences, making it a valuable resource for studies focusing on fungal diversity. However, its applicability is limited to studies specifically targeting fungal taxa.

PR2: Another database tailored to eukaryotic microbes, particularly protists. It is a good choice for environmental studies focusing on protist diversity, but may not be appropriate for other microbial groups.

MIDORI: This database is designed for environmental metagenomic studies, with a particular focus on freshwater and marine microbiomes. Like other specialised databases, it is very useful for specific research but may not cover all taxa or regions of interest.

While these reference databases are valuable resources, they often have specific limitations:

Narrow focus: Many databases, such as UNITE or PR2, specialise in particular taxa (fungi or protists) and are not suitable for studies involving other groups, such as bacteria or archaea. This can limit their usefulness if your project involves a wider range of organisms.
Primer bias: The effectiveness of these databases also depends on the primers used for sequencing. If your primers do not match the regions or taxonomic groups well represented in the database, the annotations may be less accurate or incomplete. This is particularly an issue with databases that focus on specific regions, such as SILVA, which may not be ideal for primers designed for specific taxonomic groups that are not well represented in the database.
Incomplete taxonomic coverage: Some databases may not include certain species or strains, especially from understudied environments. This can lead to misidentifications or missed taxa, especially for rare or novel species.

Reference databases are essential for taxonomic annotation in metabarcode and amplicon sequencing, but they have certain limitations, such as their specific taxonomic focus or primer compatibility. While well-established databases such as SILVA, UNITE, PR2 and MIDORI are valuable resources, they may not be the best fit for every project. In such cases, building a custom reference database using NCBI sequences provides a flexible alternative that can be tailored to the specific needs of your project.

Alternative: Building a Custom Reference Database

An alternative approach is to build a custom reference database using sequences from larger, more comprehensive resources such as NCBI. This approach is particularly useful if you're using a classifier such as SINTAX, which can tolerate some mismatches or problematic references, allowing greater flexibility in dealing with diverse or less well characterised taxa.

Use the NCBI database: NCBI hosts a large number of publicly available nucleotide sequences from various organisms, which can be extracted and used to build a custom reference database for your project. This method allows you to target specific taxa of interest and filter sequences based on your experimental design, such as focusing on particular regions, taxonomic groups or primer binding sites.

Build the database: You can use tools such as BLAST or vsearch to extract relevant sequences from NCBI based on the focus of your project, and then filter these sequences to create a database that best suits your needs. This approach can be particularly useful if you're targeting taxonomic groups that are poorly represented in existing databases.

Advantages of a custom database: A custom database allows you to create a more relevant reference collection based on the specific primers used in your sequencing project. It can also help overcome the limitations of general purpose databases by including novel or underrepresented sequences, thereby improving annotation accuracy.

Blackman et al. (2023). General principles for assignments of communities from eDNA: Open versus closed taxonomic databases. Environmental DNA, 5, 326–342

Challenges of Heteroduplex Formation in PacBio Long-Read Sequencing

PacBio long-read sequencing has transformed genomics by delivering highly accurate, full-length reads through Circular Consensus Sequencing (CCS). However, a persistent challenge with this technology is the formation of heteroduplexes, particularly in applications such as amplicon sequencing.

Heteroduplexes occur when the forward and reverse strands of a DNA molecule contain mismatches, often resulting from genetic variation, PCR errors, or incomplete strand separation. When these mismatched strands are sequenced, they can lead to ambiguous base calls and reduce the accuracy of variant detection.

On previous PacBio platforms, such as the Sequel, if there were significant sequence differences between the forward and reverse strands, the system would output a separate CCS read for each strand.

With Revio, however, the consensus process has changed. Instead of generating separate forward and reverse reads, the system now generates a single 'consensus' sequence that represents the most likely sequence. While this improves the overall read quality for many applications, it can cause problems when heteroduplexes are present.

For some genomic applications, this shift to a single consensus read is beneficial. However, for amplicon sequencing, where precise allele resolution is critical, this approach can mask true genetic variation.

While PacBio's long-read sequencing continues to set new standards for accuracy and completeness, the handling of heteroduplexes remains a nuanced challenge, especially in applications that require precise variant detection.