Data Preparation

Experimental / Sampling Design

The sampling design is simple. There are 48 samples in total. The study compares apples from two different management groups (treatment). For each treatment, they took 6 tissue samples from four apples.

N(samples): 2 treatments x 4 replicates x 6 tissues = 48

The experimental design is well desribed in the paper.

Four apples, weighing 190 ± 5 g, were selected from each of the two management groups and each apple was divided into six tissues with the following weights: stem: 0.2 g, stem end: 2 g, peel: 9 g, fruit pulp: 12 g, seeds: 0.2 g, and calyx end: 3 g. Thus, each tissue was represented by four replicates, where each replicate consists of the respective tissue of one apple. source: Wassermann et al. 2019

The way I understand it, there are 4 data points for each treatments and tissue. This is important because my understanding does not correspond with the results of the diversity analysis as shown in figure 2 (see alpha diversity). The design also indicates that the different tissue samples are not independent samples since they derived from the same apple. This is important for the comparison between treatment where all samples are grouped together.

Heterogeneous Sample Preparation

There are three major concerns I have about pooling samples from different tissues to study treatments effects. (i) The tissue sample weight differs by a factor of 60. This could be problematic for comparisons since sample collection (e.g., genomic DNA concentration) can influence the variability in microbial communities (Multinu et al. 2018). (ii) Data obtained from samples using different DNA extraction protocols is problematic. DNA extraction method can influence the observed bacterial diversity (e.g., Teng et al. 2018) and bias comparisons. (iii) Another good reason for not combining (exterior and interior) tissue samples is the post-harvest treatment of non-organic apples (see below).

Management Groups

The studies focused on Arlet (Swiss Gourmet) apples from two different management groups: organic and conventional. The comparison between treatment is a central part of the study. The fact that the conventional grown apples underwent further treatment after the harvest while the organic apple did not is clearly described in materials and methods but ignored in the interpretation.

In contrast to the organically produced apples, they underwent the following post-harvest treatments: directly after harvest, apples were short-term stored under controlled atmosphere (1–2◦C, 1.5–2% CO2), washed and wrapped in polythene sheets for sale. Both apple management groups (“organic” and “conventional”) were transported to laboratory immediately and processed under sterile conditions. source: Wassermann et al. 2019

Treatment Bias

Based on a paper by Buchholz et al. (2018) it might be fair to assume that the post-harvest handling of the non-organic apples could have influenced their microbiota composition at least regarding communities associated with exterior tissue samples like peel, calyx, or stem. The authors, however, completely ignore this possibility and conclude that the difference they found between treatments is caused by the management practise. I think it is not legitimate to simplify the comparison to organic versus non-organic apples!

Raw Data

The authors stated that the raw sequence files to support their study are available from the ENA.

The raw sequence files supporting the findings of this manuscript are available from the European Nucleotide Archive (ENA) at the study Accession Number: PRJEB32455. source: Wassermann et al. 2019

Because Illumina MiSeq paired-end 250nt was used for sequencing, I expected to get two files (R1 and R2) per sample. To my surprise, there was only one sequence file per sample available. I asked the authors for help and got the following answer: 

We receive our reads from the sequencing company in form of one forward file and one reverse file containing all reads of the whole pool, which, in this case, contained other samples as well. Data are demultiplexed after joining reads and removing barcodes. Those sequences are than provided in ENA. Thus, I cannot sent you the raw files from this pool. I´m sorry for that. source: personal communication with Brigit Wassermann 

Ihe raw MiSeq reads are not available. The authors provide merged reads but provide little detail about the demultiplexing or merging process.  

Raw sequence data preparation and data analysis was performed using QIIME 1.9.1 (Caporaso et al., 2010). After paired reads were joined and quality filtered (phred q20), chimeric sequences were identified using usearch7 (Edgar, 2010) and removed. source: Wassermann et al. 2019

While we can discuss the meaning of “raw data” but I am missing clarity here.

What are raw data

Raw data from an Illumina MiSeq paired-end data run would comprise a forward (R1) and a reverse (R2) read if standard Illumina adaptors were used and the samples are de-muliplex by the system. Customisation with barcoded primers might not alow an automated demuliplexing. Uploading demultiplexed reads are therfore justified but this needs to be clearly indicated and the process clearly descripted.  

What does this matter? Merging forward (R1) and reverse (R2) read can have an influence on the data removed. It is not uncommon that removing the qualitatively bad ends of reads increases the merging efficiency. Here an exmaple from a 16S amplicon sequenceing data set produced at the GDC:

# R1:  0nt; R2:  0nt => 50.68% (merging efficiency)
# R1:  5nt; R2: 10nt => 95.62%
# R1: 10nt; R2: 20nt => 98.55%
# R1: 15nt; R2: 30nt => 90.75%

The read-merging step is unclear and therfore the adjusted quality scores of the merged data not reproducible.

Data Prepartion

I believe that we should keep as much data during the data process as possible. If we remove data during the data processing steps (e.g., filtering, merging) we should document it. I have no troubles to remove data as long as I have good reasons for doing so.

Negative and Positive Controls

The study does not include any negative controls. Therefore, we cannot estimate contamination. This is especially important because the samples are most likely processed by different experimentator (i.e. students) with different lab experiences. There are also no positive controls included in the study.  

Data Preparation

I downloaded 48 fastq.gz files from the European Nucleotide Archive (ENA) using study number PRJEB32455. In total, there are 8,248,796 amplicon sequences (merged reads). I compared the amplicon length distribution with expected distribution obtain from in-silico PCR results based on SILVA 16S database (version 128) and NCBI bacteria genomes (version July2019).

While the mean amplicon length is similar for the expected (292nt) and the observed (289nt) the distribution is not. There are many more shorter fragments present in the apple data.

Primer Trimming

The downloaded sequences still contained the primer region. I did not find any information about primer removal in the paper. I therefore assume, the primer region was not trimmed prior to clustering.

For culture-independent Illumina MiSeq v2 (250 bp paired end) amplicon sequencing, the primers 515f – 806r (Caporaso et al., 2010) were used to amplify the 16S rRNA gene using three technical replicates per sample. source: Wassermann et al. 2019

I assume the original primer 515F and 806R have been used for the amplification of the V4 region of the 16S SSU rRNA (Caporaso et al. (2011))

>515f
GTGCCAGCMGCCGCGGTAA
>806r
GGACTACHVGGGTWTCTAAT

I used usearch to remove the primer region allowing maximum 1 mismatch but not at primer end.

usearch v11.0.667_i86linux64
Amplicon range: 100-2000
Number of mis-matches: 1 / not at the end
Coverage: full-length
Wildcards enabled: IUPAC codes

I allowed one mismatch to reduce the data loss at this step. 

# Mismatches: 0 =>  7,713,726
# Mismatches: 1 =>    335,973
# Mismatches: 2 =>     16,135
# Mismatches: 3 =>      7,082

Quality Filtering

In a next step, I quality filtered the data using Prinseq.

PRINSEQ-lite 0.20.4
Size Range: 100-600
GC Range: 30-70
Min Q Mean: 20
Number of Ns: 0
Low Complexity: dust / 30

OTU Clustering / Amplicon Sequence Variants

I used two different approaches to get operational taxonomic units (OTUs): UPARSE (Edgar 2016a) and UNOISE (Edgar 2016b). I also did some additional cluster for the zero OTUs (ZOTUs) to account for possible early PCR errors. 

After removing chimeric, mitochondrial and chloroplast sequences, the overall bacterial community of all apple samples, assessed by 16S rRNA gene amplicon sequencing, contained 6,711,159 sequences that were assigned to 92,365 operational taxonomic units (OTUs). source: Wassermann et al. 2019

Number of OTUs     : 3,124 (308 Mitochondria / 52 Chloroplast)
Number of ZOTUs    : 3,182 (210 Mitochondria / 70 Chloroplast)

Although the amplicon sequence variant approach (ZOTU) has a tendency to overestimate the number of OTUs, I still found 30 times fewer OTUs than the authors. I cannot explain the difference because we both used USEARCH.

Representative sequences were aligned, open reference database SILVA (ver128_97_01.12.17) was used to pick operational taxonomic units (OTUs) and de novo clustering of OTUs was performed using usearch. source: Wassermann et al. 2019 

The data processing of the provided merged reads removed on average 2.5% of the data. The data loss and the average amplicon size were steady between samples. The data processing of the provided merged reads removed on average 2.5% of the data. The data loss and the average amplicon size were steady among the samples. Sequence depth ranged from 18k to 428k with fruit pulp samples being at the lower range of the spectrum and calyx samples at the top.

Data-Prep Statistic

Sample Merged Primer Clean MeanLength D%
H-Stem-a 205945 201311 201311 251.7 97.7
H-Stem-b 231409 223661 223661 251.7 96.7
H-Stem-c 217307 211969 211969 250.1 97.5
H-Stem-d 218388 214743 214743 251.5 98.3
C-Stem-a 181279 174667 174667 252.1 96.4
C-Stem-b 144427 141113 141113 252.4 97.7
C-Stem-c 372408 363099 363099 249.9 97.5
C-Stem-d 385316 375813 375813 252.5 97.5
H-StemEnd-a 221403 206834 206834 253.0 93.4
H-StemEnd-b 320614 310926 310926 251.9 97.0
H-StemEnd-c 86899 84075 84075 252.3 96.8
H-StemEnd-d 197981 194176 194176 251.9 98.1
C-StemEnd-a 197662 193093 193093 251.6 97.7
C-StemEnd-b 311501 303738 303738 252.3 97.5
C-StemEnd-c 251718 245561 245561 251.0 97.6
C-StemEnd-d 180369 176591 176591 251.4 97.9
H-Seeds-a 137752 135032 135032 252.3 98.0
H-Seeds-b 91782 89809 89808 252.3 97.8
H-Seeds-c 132468 130086 130086 252.2 98.2
H-Seeds-d 95599 93835 93834 252.2 98.2
C-Seeds-a 54411 52680 52680 252.6 96.8
C-Seeds-b 163095 159090 159090 252.1 97.5
C-Seeds-c 81263 79258 79258 251.8 97.5
C-Seeds-d 95157 92992 92992 252.0 97.7
H-Peel-a 157919 155094 155094 238.6 98.2
H-Peel-b 242554 238273 238273 239.4 98.2
H-Peel-c 111213 108852 108852 245.5 97.9
H-Peel-d 76360 74889 74889 238.7 98.1
C-Peel-a 18693 18266 18266 252.3 97.7
C-Peel-b 39117 38176 38176 252.4 97.6
C-Peel-c 68716 67016 67016 252.3 97.5
C-Peel-d 75548 73492 73492 249.1 97.3
H-FruitPulp-a 39123 38461 38461 250.8 98.3
H-FruitPulp-b 33287 31395 31395 247.9 94.3
H-FruitPulp-c 51723 50754 50754 250.6 98.1
H-FruitPulp-d 30431 29913 29913 248.7 98.3
C-FruitPulp-a 50188 48984 48984 252.1 97.6
C-FruitPulp-b 33734 32922 32922 251.7 97.6
C-FruitPulp-c 84835 82503 82503 252.1 97.3
C-FruitPulp-d 44167 42822 42822 252.3 97.0
H-CalyxEnd-a 225921 222124 222124 242.7 98.3
H-CalyxEnd-b 273104 268632 268632 239.9 98.4
H-CalyxEnd-c 349714 342698 342698 246.0 98.0
H-CalyxEnd-d 158404 155816 155816 250.5 98.4
C-CalyxEnd-a 289295 283139 283139 251.7 97.9
C-CalyxEnd-b 427153 414207 414207 252.6 97.0
C-CalyxEnd-c 350489 340059 340059 248.3 97.0
C-CalyxEnd-d 440955 428514 428514 251.7 97.2
Total 8248796 8041153 8041151
Mean 171850 167524 167524 250.1 97.5

Literature

Multinu et al. (2018). Systematic Bias Introduced by Genomic DNA Template Dilution in 16S rRNA Gene-Targeted Microbiota Profiling in Human Stool Homogenates. mSphere, 3(2).

Buchholz et al. (2018) The potential of plant microbiota in reducing postharvest food loss. Microbial Biotechnology, 11(6).

Caporaso, et al. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA 108, 4516–4522.

Krzywinski & Altman (2014) Visualizing samples with box plots. Nature Methods. Vol.11 No.2.

McDonald (2014) Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland.

Edgar (2016a) UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, https://doi.org/10.1101/081257

Edgar (2016b), UCHIME2: improved chimera prediction for amplicon sequencing, https://doi.org/10.1101/074252

Teng et al. (2018) Impact of DNA extraction method and targeted 16S-rRNA hypervariable region on oral microbiota profiling. Scientific Reports 8, 16321.