Data Prep Output - GDC First Aid Kit

Overview

It doesn't matter what data type we had at the beginning. Once your data is prepared, you should have received 3 zipped files (the three Rs).

Raw data
Report
Results

<ProjectNumber>_<RunDate>_<Gene>_[RawData/Results/Reports].zip

Raw data

This is a copy of your sequencing data. Submit it to a read archive and keep a copy for yourself. For Illumina data you should have two fastq files (R1 and R2) and for PacBio one (ccs) bam file. There are exceptions. Please ask if you are unsure. The original filenames can be long, redundant and meaningless, we usually simplify the filename to include only relevant information (e.g. sample ID). Here is an example:

MSQ184049_Sample_RH01_S1_L001_R1_001.fastq.gz > RH01_R1.fq.gz

Secure Storage

The raw data is very important and must be kept safe. It is important to understand that you and only you are responsible for your data. So, we leave it up to you to decide what safe means in this case. We will be happy to advise you. A copy should also be deposited at the European Nucleotide Archive (ENA) or the NCBI Sequence Read Archive (SRA) as early as possible.

Reports

We have divided the data preparation process into several steps and a report file is generated for each step. These files are important because they list the program, including version, the parameters applied, any warnings or errors, and contain statistics. We rely on these files to evaluate the data processing. You will need the files for documentation purposes. It is therefore very important that the report files are not altered.

Have a closer look at the files:

# all together
ll y_help/[ABCDEFG]*.report
# or choose a specific step (e.g. B)
less y_help/B_*.report

There is also a detailed read (loss) report. For convenience, the text file (txt) can be imported into a spreadsheet editor. It documents the loss of data for each of the processing steps for each sample.

cat y_help/*ReadStats.report # a delimited (;) text file

Pos	Header	Meaning
$1	Sample	Sample ID
$2	Raw	Raw reads
$3	Clean	PhiX and low complexity removed
$4	Merged	R1 and R2 merged
$5	Primer	Primer site removed
$6	Filter	Q-filter, size and CG filter
$7	MeanLength	Mean amplicon length

Data Loss

The overall loss should be below 15%. Some samples, especially negative controls or samples with low counts, might have much higher losses. This could also lead to an increase in the mean loss rate. Careful, pooled samples (different amplicons but identical index pooled together) also imply high data losses. The counts for samples with the same index have to be combined.

Results

During data processing, various results are stored that can be used later for analysis or to better understand the data and processing. The number of files may seem overwhelming at first, but there is a reason and a logic behind it. There are similar files for different methods (e.g. count tables for UPARSE and UNOISE) and there are support files to help you understand and evaluate the results.

Data Analysis

For data analysis in R, you do not need all the files. Important are the annotated Count/OTU tables (*_Count_Sintax.txt), the tree files (*.tre) and maybe the zOTU fasta files (*.fa.gz).

e_OTU/<ProjectNumber>_<RunData>_<Gene>

..._ZOTU_chimera.txt            -> additional chimera evaluation (see report F)
..._ZOTU_CLU.tre                -> cluster based tree (see report E3)
..._ZOTU_Count.map.gz           -> read mapping reports
..._ZOTU_Count_Sintax.txt       -> count table with annoation
..._ZOTU_Count.summary          -> count table summary
..._ZOTU_Count.txt              -> count table
..._ZOTU_CrossTalk.txt.gz       -> cross-talking corrected counts (see cross talk html reports)
..._ZOTU.fa.gz                  -> cluster sequences (fasta)
..._ZOTU_MSA.tre                -> muscle based tre (see report E3)
..._ZOTU.mx.gz                  -> cluster assignments reports
..._ZOTU.tax                    -> sintax reports

Count Tables (with and without annotation)

For each sequence cluster method we have a number of different files, including the (annotated) count tables.

OTU      : the classic 97% identity (3%-radius) clustering method (UPARSE)
ZOTU     : the newer zero-radius (zOTU) or amplicon sequence variant (ASV) method (UNOISE3)
ZOTU_cXX : the zOTU method with additional clustering

The classic 97% clustering approach is still widely used and many of the criticised shortcomings (e.g. random centroid) have been addressed in UPARSE. However, many prefer the more recent UNOISE approach. It is similar to the amplicon sequence variant (ASV) approach. Depending on the abundance threshold cut-off, this method results in many not well supported (rare) zOTUs and therefore may overestimate diversity. We also apply additional clustering at 99%, 98% and 97% identity to the zOTUs to better understand the clustering results. Typically, the number of OTUs is similar to the number of ZOTUs clustered at 98% or 97%. In fact, the number of ZOTUs (clustered at 97%) is usually slightly lower. The UNOISE approach has a built-in error correction step and this may explain the differences.

Files for Data Import

The following three files, together with the map(meta)file, are recommended for data import into Phyloseq. However, only the (annotated) count/OTU table and the map file are required. The tree and sequence files are optional. I usually include the tre file because some data analysis methods (e.g. UniFrac) require it. If you have many (Z)OTUs and longer amplicons, the sequence file can be quite large and slow down data analysis.

Count table    : e_OTU/*_*OTU*_Count_*.txt
Tre file       : e_OTU/*_*OTU*.tre
Fasta sequences: e_OTU/*_*OTU*.fa*

I usually provide a working map file (*_MapFileTemplate.txt). This file might neither be complete nor correct. Please adjust it according to your needs but do not change the first three and the last columns. Everything in between is yours for the "changing". A word of advise if I may, use short column ids (names) without white space (e.g. "sample_tissue" is better than "sample tissue" but "tissue" alone might be even better).

Krona Pie Charts

For a first impression of the diversity of the data, there are interactive Krona pie charts in html format.

ls -lh y_help/*_Krona_Counts.html                            # per sample
ls -lh y_help/*_OTU_Krona_TaxPrevalence_TotalAbundance.html  # all samples together
ls -lh y_help/*_ZOTU_Krona_TaxPrevalence_TotalAbundance.html # all samples together

Taxonomic prevalence: How often a particular taxon occurs (frequency).
Total abundance: The sum of all counts for a given taxa.

Taxonomic Assignments

Please do not blindly trust any of the taxonomic assignments! They are at best suggestions and the error rate increases with taxonomic depth. The assignments added to the (z)OTUs in the count tables are filtered but still subject to uncertainty. The confidence threshold used can be found in the report file F and should range between 70%-90%. In addition, the unfiltered assignments are available as a text file (*.tax) to examine the confidence values at different taxonomic levels.

grep "Tax Filter" y_help/F_*_TAX.report
head e_OTU/*_ZOTU.tax

It is important to know the reference used. This information can also be found in the report file F. Factors such as diversity, accuracy and sequence coverage of the reference will influence the result. More is not always better. The species composition of the reference should cover the expected diversity and include outgroups. References with incorrect, missing or inadequate designations (e.g. environmental sample) are of little use. In addition, the reference sequences should cover the entire amplicon.

grep "Application" -A 2 y_help/F_*_TAX.report

Count Table Summary

There are summary reports for the (Z)OTU tables giving statistics such as number of reads (amplicons), number of samples, number of (Z)OTUs, and a detailed count statistic.

cat results/*_Count.summary

Tre-Files

There are two different tree files for each cluster method one: multiple sequences alignment (MAS) and cluster-based (CLU) tree files. These files could help to understand and refine the OTU clustering and identify outliers.

ls -1 e_OTU/*.tre
# *_CLU.tre
# *_MSA.tre

Note: The tree will be very approximate in both cases, but the accuracy of the tree doesn't matter much for most analyses.

Uncross

Detects and filters cross-talk (sample mis-assignment) in a OTU table using the UNCROSS algorithm. In a typical run, about 2% of reads are assigned to the wrong sample ID. If some samples contain large numbers of reads for a given OTU, these often "bleed" into other samples which may not in fact contain that OTU. This can cause may spurious counts which should be zero, giving inflated estimates of richness, alpha diversity and beta diversity. Although this is still experimental at best it is interesting get a better understanding about possible.