Clustering

Overview

The cleaned reads are de-replicated prior to clustering. I used two different approaches to cluster the reads: The classical approach with a 97% identity clustering using USEARCH::UPARSE. The more recent approach with a zero-radius clustering as implemented in USEARCH::UNOISE. I used an abundance threshold of 2 for UPARSE and an abundance threshold of 10 for UNOISE approach. This is an important step to remove artificially created and therefore untrustworthy singletons and rare ZOTUs. We only remove a small percentage of the data, and most reads (98.9%) can be mapped back to the (Z)OTUs to produce count tables.

Data Clustering

A quick look at the raw read counts shows that most of the negaitve samples failed (as expected), all the positive samples have counts, and (despite a few exceptions) most samples do too. The raw count number for the samples ranges from 8 to 93,441 with a median of 20,450.

Raw Read Counts