Skip to content

OSU-Mapping

Overview

  • OTUs: In a fist step, I de-replicate the references to get unique sequences, which I call operational sequences units (OSUs).
  • Mapping: Next, I map the cleaned reads to the OSUs using different identity (ID) and query-coverage (QC) thresholds.

OSU-Mapping

BAFU 12S Reference Sequences ➜ OSUs

There are 902 fasta sequences in the BAFU 12S MiFish reference. After de-replication we have 107 (11.9%) unique sequences. We will call these unique refernces OSU (operational sequence untis) and for the mapping, we only use these unique sequences.

We have adjusted the sequence headers of the OSUs as follows. If all united sequences belong to the same species, the species name was taken over. The number of sequences that were merged was recorded with the note n at the end. In addition, the number of groupings per species was also noted.

Here an example Alburnoides_bipunctatus:

>Alburnoides_bipunctatus_1of2_n9

There are two (1of2) groups (clusters) of unique sequences for Alburnoides bipunctatus. The first cluster (1of2) contains 9 sequences.

Clusters with a mix of species such as Salmo were designated sp.

>Salmo_sp_1of3_n47

Mapping

For the read mapping we have two important parameters available: Sequence identity (ID) and query coverage (QC). We aim for highly accurate (1% ∼ 1 mismatch) and long (1% ∼ 2 nt) alignments. Both parameters will influence the number of reads that can be mapped to the OSUs. Stringent mapping criteria reduce mapping efficiency while relaxed ones will cause false assignment.

Special Cases

If there is a tie better two or more OSUs the read count goes to the first sequences (drip down). OSUs without counts are removed. Furthermore, mis-match positions are irrelevant. A potentially more informative nucleotide position is just another mis-match.

We use different values for identity (ID) and query coverages (QC) thresholds to find good mapping parameters.

ID QC Reads Mapped % Mapped N(OSU)
1.00 1.00 4,592,035 51.7 87
1.00 0.99 4,592,105 51.7 87
1.00 .098 4,592,107 51.7 87
1.00 0.97 4,592,107 51.7 87
0.99 0.10 6,817,133 76.7 91
0.99 0.99 6,817,272 76.7 91
0.99 0.98 6,817,276 76.7 91
0.99 0.97 6,817,278 76.7 91
0.98 1.00 7,540,240 84.8 92
0.98 0.99 7,540,417 84.8 92
0.98 0.98 7,540,425 84.8 92
0.98 0.97 7,540,429 84.8 92
0.97 1.00 7,624,965 85.8 92
0.97 0.99 7,625,190 85.8 92
0.97 0.98 7,625,210 85.8 92
0.97 0.97 7,625,214 85.8 92

Identity (ID) threshold has a bigger influence on the mapping rate than query coverage (QC). The best results were obtained with the default values ID:97 and QC:100.

Here are the 15 OSUs to which no read could be mapped. With one exception (Alburnus arborella), these OSUs correspond to missing species in this dataset.:

- Alburnus_arborella_2of2_n1
- Alosa_sp_1of1_n8
- Ameiurus_melas_1of1_n5
- Cobitis_bilineata_1of1_n18
- Lampetra_planeri_1of1_n4
- Micropterus_salmoides_1of1_n5
- Neogobius_kessleri_1of1_n2
- Padogobius_bonelli_1of1_n9
- Rhodeus_amarus_1of3_n3
- Rhodeus_amarus_2of3_n11
- Rhodeus_amarus_3of3_n1
- Rutilus_pigus_1of2_n1
- Rutilus_pigus_2of2_n9
- Sabanejewia_larvata_1of1_n1
- Silurus_glanis_1of1_n12

OSU Extension

About 15% of the reads cannot be mapped to the OSUs. There are several reasons that could explain this observation. One possible reason could be the lack of species in the reference. For example, there are positive controls with cod in the dataset that are missing in the reference. For this reason, we extend our OSUs with consensus sequences of Gadus chalcogrammus and Gadus morhua. We also know from the OTUs annotation that we have additional sequences that are not from fish. For this reason, we also add some additional potential target sequences to the OSU mapping reference.

- Homo sapiens
- Sus scrofa
- Sturnus vulgaris
- Rupicapra rupicapra
- Ichthyosaura alpestris
- Bos taurus
- Turdus_merula

With these additional species we expand the OSUs reference to 116 sequences.

ID QC Reads Mapped % Mapped N(OSU)
1.00 1.00 4,683,758 52.7 95
1.00 0.99 4,684,009 52.7 95
1.00 0.98 4,684,012 52.7 95
1.00 0.97 4,684,012 52.7 95
0.99 1.00 6,991,312 78.6 99
0.99 0.99 6,991,768 78.6 99
0.99 0.98 6,991,773 78.6 99
0.99 0.97 6,991,775 78.6 99
0.98 1.00 7,753,236 87.2 101
0.98 0.99 7,753,792 87.2 101
0.98 0.98 7,753,802 87.2 101
0.98 0.97 7,753,806 87.2 101
0.97 1.00 7,877,395 88.6 101
0.97 0.99 7,878,060 88.6 101
0.97 0.98 7,878,082 88.6 101
0.97 0.97 7,878,086 88.6 101

The mapping results changed, albeit only slightly. However, the number of OSUs with read counts increased for all settings and so did the mapping efficiency. The OSUs without counts (n=15) did not change as expected.

Mapping Analysis

Positive Control Samples (PCP)

The most dominate species in the positive samples (PCP) is Atlantic cod (Gadus morhua). We have to keep in mind that we only have two species of cod (Gadus morhua and Gadus chalcogrammus) in the reference. Although the positive sample contains only one species, we also find other as well.

Heat Map PCP

Negative Control Samples (FCN and PCN)

There are only two (filter) negative samples with significant (>50) number of total counts. The composition of the two negative samples is different. The Lake trout (Salvelinus namaycush) is unique to sample FCN-1A but the ZOTU with MitoFish-Plus annotation shows bacterial contamination of about 10%. Negative sample FCN-1C contain Squalius cephalus and Salmo carpio to equal parts. Again, ZOTU with MitoFish-Plus annotation shows bacterial contaminant.

Heat Map FCN

>ZOTU78 (Alpha proteobacterium - Roseomonas?)
AGCCGCGGTAATACGAAGGGAGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGCGGCGTAAGTCAGA
TGTGAAATTCCTGGGCTTAACCTGGGGGCTGCATTTGAGACTGCGTTGCTAGAGGACGGAAGAGGCTCGTGGAATTCCCA
GTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCGAGCTGGTCCATTACTGACGCTGAGGC
GCGATAGCGTGGGGAG
>ZOTU81 (Sediminicoccus rosea - Acetobacteraceae?)
AGCCGCGGTAATACGAAGGGGGCTAGCGTTGCTCGGAATGACTGGGCGTAAAGGGCGCGTAGGCGGCACAACTCGTCAGG
CGTGAAATTCCTGGGCTTAACCTGGGGGCTGCGTTTGATACGGTTGAGCTAGAGGATGGAAGAGGCTCGTGGAATTCCCA
GTGTAGAGGTGAAATTCGTAGATATTGGGAAGAACACTGGTGGCGAAGGCGGCGAGCTGGTCCATTACTGACGCTGAGGC
GCGAAAGCGTGGGGAG

Samples

Pairwise sample comparisons show that the composition is similar for the ZOTUs and for the OSUs approach. Differences exist for the rare zOTUs and OSUs, respectively. OSUs always have species indications, but this could give a false sense of security. In addition, the reference restricts species recognition for the OSU approach, while the zOTU approach is not restricted and has the potential to discover more species (e.g. 1% diatoms).

Heat Map Example #1

Heat Map Example #2

Top zOTUs and OSU Tabels

MitoFish vs MitoFishPlus vs MIDORI
zOTUs vs OSU