Data Preparation

Experimental / Sampling Design¶

The sampling design is simple. There are 48 samples in total. The study compares apples from two different management groups (treatment). For each treatment, they took 6 tissue samples from four apples.

N(samples): 2 treatments x 4 replicates x 6 tissues = 48

The experimental design is well desribed in the paper.

Four apples, weighing 190 ± 5 g, were selected from each of the two management groups and each apple was divided into six tissues with the following weights: stem: 0.2 g, stem end: 2 g, peel: 9 g, fruit pulp: 12 g, seeds: 0.2 g, and calyx end: 3 g. Thus, each tissue was represented by four replicates, where each replicate consists of the respective tissue of one apple. source: Wassermann et al. 2019

The way I understand it, there are 4 data points for each treatments and tissue. This is important because my understanding does not correspond with the results of the diversity analysis as shown in figure 2 (see alpha diversity). The design also indicates that the different tissue samples are not independent samples since they derived from the same apple. This is important for the comparison between treatment where all samples are grouped together.

Heterogeneous Sample Preparation

There are three major concerns I have about pooling samples from different tissues to study treatments effects. (i) The tissue sample weight differs by a factor of 60. This could be problematic for comparisons since sample collection (e.g., genomic DNA concentration) can influence the variability in microbial communities (Multinu et al. 2018). (ii) Data obtained from samples using different DNA extraction protocols is problematic. DNA extraction method can influence the observed bacterial diversity (e.g., Teng et al. 2018) and bias comparisons. (iii) Another good reason for not combining (exterior and interior) tissue samples is the post-harvest treatment of non-organic apples (see below).

Management Groups¶

The studies focused on Arlet (Swiss Gourmet) apples from two different management groups: organic and conventional. The comparison between treatment is a central part of the study. The fact that the conventional grown apples underwent further treatment after the harvest while the organic apple did not is clearly described in materials and methods but ignored in the interpretation.

In contrast to the organically produced apples, they underwent the following post-harvest treatments: directly after harvest, apples were short-term stored under controlled atmosphere (1–2◦C, 1.5–2% CO2), washed and wrapped in polythene sheets for sale. Both apple management groups (“organic” and “conventional”) were transported to laboratory immediately and processed under sterile conditions. source: Wassermann et al. 2019

Treatment Bias

Based on a paper by Buchholz et al. (2018) it might be fair to assume that the post-harvest handling of the non-organic apples could have influenced their microbiota composition at least regarding communities associated with exterior tissue samples like peel, calyx, or stem. The authors, however, completely ignore this possibility and conclude that the difference they found between treatments is caused by the management practise. I think it is not legitimate to simplify the comparison to organic versus non-organic apples!

Raw Data¶

The authors stated that the raw sequence files to support their study are available from the ENA.

The raw sequence files supporting the findings of this manuscript are available from the European Nucleotide Archive (ENA) at the study Accession Number: PRJEB32455. source: Wassermann et al. 2019

Because Illumina MiSeq paired-end 250nt was used for sequencing, I expected to get two files (R1 and R2) per sample. To my surprise, there was only one sequence file per sample available. I asked the authors for help and got the following answer:

We receive our reads from the sequencing company in form of one forward file and one reverse file containing all reads of the whole pool, which, in this case, contained other samples as well. Data are demultiplexed after joining reads and removing barcodes. Those sequences are than provided in ENA. Thus, I cannot sent you the raw files from this pool. I´m sorry for that. source: personal communication with Brigit Wassermann

Ihe raw MiSeq reads are not available. The authors provide merged reads but provide little detail about the demultiplexing or merging process.

Raw sequence data preparation and data analysis was performed using QIIME 1.9.1 (Caporaso et al., 2010). After paired reads were joined and quality filtered (phred q20), chimeric sequences were identified using usearch7 (Edgar, 2010) and removed. source: Wassermann et al. 2019

While we can discuss the meaning of “raw data” but I am missing clarity here.

What are raw data

Raw data from an Illumina MiSeq paired-end data run would comprise a forward (R1) and a reverse (R2) read if standard Illumina adaptors were used and the samples are de-muliplex by the system. Customisation with barcoded primers might not alow an automated demuliplexing. Uploading demultiplexed reads are therfore justified but this needs to be clearly indicated and the process clearly descripted.

What does this matter? Merging forward (R1) and reverse (R2) read can have an influence on the data removed. It is not uncommon that removing the qualitatively bad ends of reads increases the merging efficiency. Here an exmaple from a 16S amplicon sequenceing data set produced at the GDC:

# R1:  0nt; R2:  0nt => 50.68% (merging efficiency)
# R1:  5nt; R2: 10nt => 95.62%
# R1: 10nt; R2: 20nt => 98.55%
# R1: 15nt; R2: 30nt => 90.75%

The read-merging step is unclear and therfore the adjusted quality scores of the merged data not reproducible.

Data Prepartion

I believe that we should keep as much data during the data process as possible. If we remove data during the data processing steps (e.g., filtering, merging) we should document it. I have no troubles to remove data as long as I have good reasons for doing so.

Negative and Positive Controls¶

The study does not include any negative controls. Therefore, we cannot estimate contamination. This is especially important because the samples are most likely processed by different experimentator (i.e. students) with different lab experiences. There are also no positive controls included in the study.

Data Preparation¶

I downloaded 48 fastq.gz files from the European Nucleotide Archive (ENA) using study number PRJEB32455. In total, there are 8,248,796 amplicon sequences (merged reads). I compared the amplicon length distribution with expected distribution obtain from in-silico PCR results based on SILVA 16S database (version 128) and NCBI bacteria genomes (version July2019).

While the mean amplicon length is similar for the expected (292nt) and the observed (289nt) the distribution is not. There are many more shorter fragments present in the apple data.

Primer Trimming¶

The downloaded sequences still contained the primer region. I did not find any information about primer removal in the paper. I therefore assume, the primer region was not trimmed prior to clustering.

For culture-independent Illumina MiSeq v2 (250 bp paired end) amplicon sequencing, the primers 515f – 806r (Caporaso et al., 2010) were used to amplify the 16S rRNA gene using three technical replicates per sample. source: Wassermann et al. 2019

I assume the original primer 515F and 806R have been used for the amplification of the V4 region of the 16S SSU rRNA (Caporaso et al. (2011))

>515f
GTGCCAGCMGCCGCGGTAA
>806r
GGACTACHVGGGTWTCTAAT

I used usearch to remove the primer region allowing maximum 1 mismatch but not at primer end.

usearch v11.0.667_i86linux64
Amplicon range: 100-2000
Number of mis-matches: 1 / not at the end
Coverage: full-length
Wildcards enabled: IUPAC codes

I allowed one mismatch to reduce the data loss at this step.

# Mismatches: 0 =>  7,713,726
# Mismatches: 1 =>    335,973
# Mismatches: 2 =>     16,135
# Mismatches: 3 =>      7,082

Quality Filtering¶

In a next step, I quality filtered the data using Prinseq.

PRINSEQ-lite 0.20.4
Size Range: 100-600
GC Range: 30-70
Min Q Mean: 20
Number of Ns: 0
Low Complexity: dust / 30

OTU Clustering / Amplicon Sequence Variants¶

I used two different approaches to get operational taxonomic units (OTUs): UPARSE (Edgar 2016a) and UNOISE (Edgar 2016b). I also did some additional cluster for the zero OTUs (ZOTUs) to account for possible early PCR errors.

After removing chimeric, mitochondrial and chloroplast sequences, the overall bacterial community of all apple samples, assessed by 16S rRNA gene amplicon sequencing, contained 6,711,159 sequences that were assigned to 92,365 operational taxonomic units (OTUs). source: Wassermann et al. 2019

Number of OTUs     : 3,124 (308 Mitochondria / 52 Chloroplast)
Number of ZOTUs    : 3,182 (210 Mitochondria / 70 Chloroplast)

Although the amplicon sequence variant approach (ZOTU) has a tendency to overestimate the number of OTUs, I still found 30 times fewer OTUs than the authors. I cannot explain the difference because we both used USEARCH.

Representative sequences were aligned, open reference database SILVA (ver128_97_01.12.17) was used to pick operational taxonomic units (OTUs) and de novo clustering of OTUs was performed using usearch. source: Wassermann et al. 2019

The data processing of the provided merged reads removed on average 2.5% of the data. The data loss and the average amplicon size were steady between samples. The data processing of the provided merged reads removed on average 2.5% of the data. The data loss and the average amplicon size were steady among the samples. Sequence depth ranged from 18k to 428k with fruit pulp samples being at the lower range of the spectrum and calyx samples at the top.

Data-Prep Statistic¶

Sample	Merged	Primer	Clean	MeanLength	D%
H-Stem-a	205945	201311	201311	251.7	97.7
H-Stem-b	231409	223661	223661	251.7	96.7
H-Stem-c	217307	211969	211969	250.1	97.5
H-Stem-d	218388	214743	214743	251.5	98.3
C-Stem-a	181279	174667	174667	252.1	96.4
C-Stem-b	144427	141113	141113	252.4	97.7
C-Stem-c	372408	363099	363099	249.9	97.5
C-Stem-d	385316	375813	375813	252.5	97.5
H-StemEnd-a	221403	206834	206834	253.0	93.4
H-StemEnd-b	320614	310926	310926	251.9	97.0
H-StemEnd-c	86899	84075	84075	252.3	96.8
H-StemEnd-d	197981	194176	194176	251.9	98.1
C-StemEnd-a	197662	193093	193093	251.6	97.7
C-StemEnd-b	311501	303738	303738	252.3	97.5
C-StemEnd-c	251718	245561	245561	251.0	97.6
C-StemEnd-d	180369	176591	176591	251.4	97.9
H-Seeds-a	137752	135032	135032	252.3	98.0
H-Seeds-b	91782	89809	89808	252.3	97.8
H-Seeds-c	132468	130086	130086	252.2	98.2
H-Seeds-d	95599	93835	93834	252.2	98.2
C-Seeds-a	54411	52680	52680	252.6	96.8
C-Seeds-b	163095	159090	159090	252.1	97.5
C-Seeds-c	81263	79258	79258	251.8	97.5
C-Seeds-d	95157	92992	92992	252.0	97.7
H-Peel-a	157919	155094	155094	238.6	98.2
H-Peel-b	242554	238273	238273	239.4	98.2
H-Peel-c	111213	108852	108852	245.5	97.9
H-Peel-d	76360	74889	74889	238.7	98.1
C-Peel-a	18693	18266	18266	252.3	97.7
C-Peel-b	39117	38176	38176	252.4	97.6
C-Peel-c	68716	67016	67016	252.3	97.5
C-Peel-d	75548	73492	73492	249.1	97.3
H-FruitPulp-a	39123	38461	38461	250.8	98.3
H-FruitPulp-b	33287	31395	31395	247.9	94.3
H-FruitPulp-c	51723	50754	50754	250.6	98.1
H-FruitPulp-d	30431	29913	29913	248.7	98.3
C-FruitPulp-a	50188	48984	48984	252.1	97.6
C-FruitPulp-b	33734	32922	32922	251.7	97.6
C-FruitPulp-c	84835	82503	82503	252.1	97.3
C-FruitPulp-d	44167	42822	42822	252.3	97.0
H-CalyxEnd-a	225921	222124	222124	242.7	98.3
H-CalyxEnd-b	273104	268632	268632	239.9	98.4
H-CalyxEnd-c	349714	342698	342698	246.0	98.0
H-CalyxEnd-d	158404	155816	155816	250.5	98.4
C-CalyxEnd-a	289295	283139	283139	251.7	97.9
C-CalyxEnd-b	427153	414207	414207	252.6	97.0
C-CalyxEnd-c	350489	340059	340059	248.3	97.0
C-CalyxEnd-d	440955	428514	428514	251.7	97.2
Total	8248796	8041153	8041151
Mean	171850	167524	167524	250.1	97.5

Literature¶

Multinu et al. (2018). Systematic Bias Introduced by Genomic DNA Template Dilution in 16S rRNA Gene-Targeted Microbiota Profiling in Human Stool Homogenates. mSphere, 3(2).

Buchholz et al. (2018) The potential of plant microbiota in reducing postharvest food loss. Microbial Biotechnology, 11(6).

Caporaso, et al. (2011). Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA 108, 4516–4522.

Krzywinski & Altman (2014) Visualizing samples with box plots. Nature Methods. Vol.11 No.2.

McDonald (2014) Handbook of Biological Statistics (3^rd ed.). Sparky House Publishing, Baltimore, Maryland.

Edgar (2016a) UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, https://doi.org/10.1101/081257

Edgar (2016b), UCHIME2: improved chimera prediction for amplicon sequencing, https://doi.org/10.1101/074252

Teng et al. (2018) Impact of DNA extraction method and targeted 16S-rRNA hypervariable region on oral microbiota profiling. Scientific Reports 8, 16321.