Data Prep Short-Reads (PE) - GDC First Aid Kit

The following describes a typical Illumina paired-end based amplicon sequence data processing workflow for us. That is, we assume a two-step PCR library preparation approach for paired-end amplicon-based data with overlapping reads (amplicon length < number of cycles).

The data preparation workflow is a constant construction site and we try to integrate (risky) new methods without eliminating (established) older steps. This results in similar files (e.g. OTU and zOTU) and often undocumented supporting files. Do not hesitate to ask if you have trouble understanding any of the files.

Data Processing Steps - Overview

We have divided the data processing from raw data to annotated count tables into processing steps to make better use of computer resources and speed up processing.

A. Quality Control, Data and Parameter Evaluation

ls -lh y_help/A_*.report

B. Filtering & Read Merging

less y_help/B_*_CTM.report

C. Primer Site Trimming

less y_help/C_*_PCR.report

D. Filtering

# Report
less y_help/D_*_SQF.report
# Read Stats
y_help/p*_ReadStats.report

E. Clustering

# Clustering
less y_help/E1_*_OTU.report
# Mapping
less y_help/E2-1_*_MAP.report #  OTU
less y_help/E2-2_*_MAP.report # zOTU
# Alignments and Stats
less y_help/E3_*_TRE.report

F. Tax Predictions

less y_help/F_*_TAX.report

G. Extra - e.g. ITSx for ITS amplicons

ls -lh y_help/G_*.report

Data Processing Steps - Explained

Step A1 - Quality Control

FastQC and multiQC provide a first and convenient impression of the quality of the run. In fact, many sequencing centres make use of these applications. In addition, we use USEARCH and our own scripts to better understand the data structure and optimise the data processing parameters accordingly.

QC Reports

Quality control applications are easy to use and intuitive, but are often context sensitive. In addition, these applications often use abbreviations that are essential to know. The best report is of no use if you can't understand it. So, RTFM!

Andrews (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Ewels et al. (2016) MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics

Step A2 - Data and Parameter Evaluation

For more detailed data analysis, we use Usearch::eestats and self-written scripts to help us determine the optimal parameters for processing the reads. In this step only the relevant samples are included and the controls are ignored as far as possible. Numerous controls (e.g. unique sample identification, minimum number of reads, etc.) are a very important part of this step.

Step B - Filtering & Read Merging

It is often useful to start by cleaning up the raw data and removing problematic sequences. First we remove all PhiX related reads. PhiX was added to the library to increase complexity. The MiSeq Control Software (MCS) should have already filtered out these reads, but it is better to be safe than sorry. We also use a low complexity filter which helps to clean the data further quickly and easily.

Once we have removed the worst of the data, we merge the read pairs. Often it is worth trimming the 3' end of the reads to increase the efficiency (reduce mismatches) of the merge. Trimming depends on the quality profile of the reads, the expected amplicon length and the overlap. Data loss of approximately 5% is normal. Be careful with negative controls, which often have a much higher data loss rate and can negatively affect the mean.

Illumina PhiX Support Bulletins
Seqtk
Edgar (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19):2460-2461.
Magoc and Salzberg (2011) FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27(21):2957-2963.

Step C - Primer Site Trimming

The primer regions are not part of the amplicon and must be removed. In addition, we can ensure that specific amplicons are further used. For us, it is important that we find the complete primer site and that they have as few mis-matches as possible. Normally we allow 1-2 mis-matches which must not be at the 3'-end. However, the parameters for this step depend on the primer sequences (e.g. length and number of wobble bases). A data loss below 5% is expected and everything above needs further evaluation. Special attention with sequencing data where the expected amplicon length is shorter than the read length (staggered reads).

Sample carryover implications for high sensitivity applications

The MiSeq uses a fixed template loading fluidic pathway to load the sequencing template onto the flow cell prior to a run. It is possible that a small number of template molecules may remain in this instrument fluidic pathway after loading the flow cell. These remaining molecules may be washed onto the flow cell in a subsequent run and may appear as clusters. MiSeq instruments maintained according to Illumina recommendations typically have sample carryover rates of less than 0.1% (one read per thousand). For most applications, these rare reads are discarded and do not pose a problem for analysis.

Martin (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17(1):10-12.
Edgar (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19):2460-2461.

Step D - Filtering

Before we cluster the amplicons, we filter the merged reads. We use an average quality filter (e.g. 20), a size range and a gc constant range selection. Data loss in this step is typically less than 3%.

Schmieder and Edwards (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27:863-864. (http://prinseq.sourceforge.net)
Edgar (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26:2460-2461.
BBTool - A suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.

Step E - Clustering

Two different methods are currently in use: UPARSE and UNOISE. Both methods are implemented in USEARCH. In UPARSE, the de-replicated amplicons are clustered with 97% identity and singletons (abundance threshold: 2) and chimeras are removed. In UNOISE, the de-replicated amplicons are first error corrected. Each amplicon is a potential OTU (zero-radius OTU) if it occurs frequently enough (abundance threshold). This approach often results in a high number of rare OTUs and may therefore overestimate diversity. To get a better understanding of the data structure of the clustered sequences, we additionally cluster the zOTUs with 99%, 98% and 97% identity.

To obtain count/OTU tables, the sequences (amplicons) have to be mapped back to the clusters (zOTUs). Sequence identity and coverage play a crucial role in balancing efficiency and accuracy.

Mapping Algorithms

A heuristic approach was used to map the amplicons to the (Z)OTUs to find the alignment with the highest score and generate count tables. The heuristic can be replaced by a full dynamic programming algorithm. This algorithm (Needleman-Wunsch for global alignments) is much slower (days instead of hours) than the heuristic algorithms used by default in USEARCH, but it will find alignments with the highest possible score.

Edgar (2013) UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nature Methods 10:996-998.
Edgar (2016) UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing. bioRxiv 081257; doi: https://doi.org/10.1101/081257.

Step F - Predict Taxonomic Associations

To predict taxonomic assignments, we use SINTAX and an appropriate reference. SINTAX calculates probabilities (confidence values) for each taxonomic level. We usually use cut-offs between 70-90% to control sensitivity. Choosing the right reference is not always easy. It is not important to have a reference that is as large as possible, but one that is adapted to the data. For bacterial 16S amplicon data we often use SILVA and for ITS fungi UNITE is the preferred choice.

Edgar (2016) SINTAX, a simple non-Bayesian taxonomy classifier for 16S and ITS sequences. bioRxiv 074161; doi: https://doi.org/10.1101/074161
Blackmann et al. (2022) General principles for assignments of communities from eDNA: Open versus closed taxonomic databases. Environmental DNA 5(2):326-342.

Prediction

The automatic annotation with e.g. Sintax is simply great and the only way to add tax labels to thousands of zOTUs. Nevertheless, one should be very careful with the results. zOTUs that appear to be important during the analysis should definitely be checked.

Step G - Extra(s)

For some datasets there are additional steps. For example, we use ITSx to find possible non-fungal clusters in ITS amplicon data.

Bengtsson-Palme et al. (2013) ITSx: Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for use in environmental sequencing. Methods in Ecology and Evolution 4:914-919.