Data Preparation

Handout¶

Data Preparation Workflow

Video Material by Robert Edgar¶

Q&A with Robert Edgar (drive5)¶

Questions for Robert

Links¶

Workflow Worries¶

Although the workshop is focusing on data analysis and not on the project planning, sample design or data preparation steps some of the following questions could be interesting to think about or discuss about it. There is not a right answer to any of the questions and all suggestions are worth a thought or two. The list of answers was provided by the participants of last years MDA and are neither exhaustive nor free of errors. Suggestion, extensions and constructive contributions are welcome.

Feel free to participate in the discussion.

Planning¶

What are the challenges of analysing shotgun metagenomic versus amplicon based data?

Contamination of host DNA can take up a lot of the reads → depletion step (e.g. rRNA) mandatory or need to sequence deeper?
Data analysis for metagenomics is more challenging → e.g. (meat)-assemblies are more complex
For metagenomics, more DNA template is needed.
Filtering reads is more complex for metagenomic sequences, requiring complexity filters
Database completeness

What are the advantages of analysing shotgun metagenomic over amplicon based data?

Primer independent estimate of species richness.
Annotation based on multiple regions.
Allows for multi-locus phylogeny analysis rather than a single metric of relatedness.
Sequence taxonomy and functions at the same time from more organisms (i.e. not only bacteria). Maybe manage to couple taxonomy and function?

What are the disadvantages of analysing shotgun metagenomic over amplicon based data?

Primer-independent estimate of species richness.
Large part of data have a too low level of complexity/diversity.
More difficult to assign taxonomy(?)
Sample depth/sensitivity to rare species is lower for metagenomes
Costs and labor intensive(?)

Should I use Illumina MiSeq, Illumina Nextseq, PacBio or ONT (Oxford Nanopore Technology) to sequence my samples?

It depends on the length of your amplicon and on the taxonomic information you want to obtain.
It depends on your research question.
HiSeq is these days also very much possible, as long as your amplicon is < 285nt reads (i.e. the V4 region of 16S).
MinION can allow us to sequence samples in less equipped places (e.g. research in developing countries) but the high error rate could be still a problem.
MiSeq = good to cover complex community
PacBio = good to get taxonomic assignments at (or below) species level

What is the purpose and what are the limits of a power analysis?

Determine size of the sample and number of replicate.
MUST BE RUN BEFORE CONDUCTING THE EXPERIMENT.
Inappropriate post-hoc to justify the number of samples run.

What is the difference between a non-coding amplicon (e.g. 16S) in comparison to a coding amplicon (e.g. COI). Would you treat the data differently?

Coding amplicons translate in proteins (AA), while the non-coding do not.
I would not treat the data differently.
Many functional genes can be transferred horizontally between bacterial taxa, so the assignment should be taken with care.

What is the minimum respective maximum amplicon size?

Minimum size: amplicon should be long (and diverse/variable) enough to cover species diversity.
Maximum size: paired-end reads (length depends on sequencing platform) should be merged to cover whole amplicon; large 16S amplicons with conserved regions in middle are more prone to form chimeras during PCR.
Depends on the choice for sequencing? Illumina different to others?

How can you determine the amplicon size (range)?

Run them on a gel first, then check with e.g. tapestation and calculate (add up target and primers and adapters…?)
Run an in-silico PCR and determine the mean length of the amplicons. Careful dough, the reference database might not be complet.

How do you determine/find your locus/gene specific primers?

Reference genome assemblies of target species. If not available, Blast published sequences and look for conserved regions around variable ones.
In the literature or from project specific websites (e.g. The Earth Microbiome Project).
Search specific primer databases (e.g. probeBase)
Download related sequences from e.g. NCBI, align the sequences, find conserved regions and design primers (e.g. Primer3)

What are the differences (good or bad) between using Illumine-specific indexes (barcodes) versus using barcoded primers sequences?

Using barcoded primer sequences will avoid one PCR step, saving time and potential bias. However, this PCR might not amplify as much diversity as one without barcodes, and maybe there is a bias depending on the barcode used (??).
Ordering barcoded primers sequences might be more expensive as the sequence must include: [Illumina adaptor]-[shift]-[barcode]-[universal primer]

Can you combine data from two different runs?

Technical replicates (in comparison to biological replicates) are not needed and in general identical runs can (potentially) be merged.
It is better to combine all samples in one run and repeat the run if coverage is too low. Having the same samples in both runs would allow to test for possible batch effects.
It is possible. Re-run the pipeline (data prep) using all the data from the two runs.
But might lead to different results.
If the sample and the runs were treated the same way, yes.
Yes, using a few same samples in both runs might help “aligning” the two datasets.
The easiest combination is for a library re-run, but substantial statistical checks are required to merge the data. After checking linear correlation and no significant different in ADONIS, then a merge is potentially advisable. Notably, merging is recommended to avoid accidentally inflating the power of downstream statistics.

Design¶

Why would you use wobble bases in a primer and what are the possible consequences?

Wobbles can improve taxonomic coverage in cases where sequences at primer sites differ between species.
To minimiz chimera formation.
Degenerate variants are not necessarily present in same amounts, this can distort community analyses. Purification of primers by manufacturers can have a large influence (HPLC vs. RPC). The “best” method (if affordable) may be to order each variant separately and mix them together at equal amounts. Eventually, the variant known to be predominant in the community may be ordered an mixed separately to ensure it is present at sufficient amounts (??).
Primers with wobble bases can also lead to more unspecific amplification.

What polymerase would you use for the targeted PCR and why?

One with a good balance between high fidelity and yield.
Hot start polymerase if an environmental sample is used.

What are possible disadvantages of proofreading polymerases?

Too taxon specific, might miss other species/genera.
Can lead to the creation of chimeras.
Exonuclease activity might cleave-off last nucleotide in degenerated primer and thereby remove taxon-specificity.

How many PCR cycles would you use to amplify the locus/gene of interest?

PCR1: 25 cycles to amplify template with target-specific primers.
PCR2: 10 cycles to add barcodes
Use a qPCR to work out the efficiency of your primers - typically used 30-35 for COI, but for eDNA studies we have used 35-45 in PCR1 followed by 8-10 in PCR2
Cycles that are close to the saturation.
As low as possible.
The more cycles are done, the more different biases may distort the community composition.
Optimal number can be determined with qPCR. In case of large differences (> 5 cycles), grouping of samples with “low” and “high” concentration may be a good idea.

Low-diversity samples cause difficulties with the MiSeq. What can you do to abate the problem?

Add more PhiX (but loose data), mix amplicons of different types (but similar length), add spacer of random nucleotides and length to the primers to bring the colonies on the flowcell out of phase.

Do you need biological and/or technical replicates?

Technical replicates are usually not needed is everything is kept constant.
Biological replicates are important.
Always consider: what is the targeted scientific question? Certain require more biological replicates than others.

Do you need negative and/or positive samples and why?

Both, negative and positive are "easy" and cheap and therefore should be included. The cost outweighs the benefit!
Recommended to have both. Negative to check for possible contamination. Positive to compare runs and error correction.
Using mock communities of know composition and abundance can help correcting for primer biases towards particular taxa.
Positive samples are recommended first as a gold standard, because it also incorporates the negative concept in it as well to detect contaminations (non-mock community sequences)
Using mock communities with highly diluted species (e.g. geometric distribution) can be very useful for determining retrieval of rare species.
With low input DNA, the amplification of rare species can be very bad and subject determined by randomness.

What would be a good negative samples?

For plant roots: sterile root material
In general: water control (to see any contaminants during the entire sample preparation process).
The same negative control throughout the whole library preparation.

What could you use as a positive sample?

Mock community.
Standards (qpcr).
DNA from a single known organism may also work.
DNA sample of known identity/composition.
Tissue DNA extract from a species not present in your samples, but amplifies with primers.
Best would be a mock/established community in a matrix reflecting any potential contaminants from your initial sample.

What are chimeras and how can you deal with them?

A sequence that consists of an accidental merging from two separate sequences (i.e. 16S fragments from two different species merged into the same 16S).
Chimera detection is difficult but we can assume that chimeras although similar are not necessarily identical. Abundance sorting could help (e.g. remove singletons) but also BLAST searches might help. Identical or nearly identical hits with even unknown records from the database indicate that the query sequence might no be a chimera. While partial hits might indicate chimera formations.

Can we avoid chimeras?

No fully but we can minimise it. For example with the amplicon design or by optimising the PCR condition.

Quality Control¶

The average phred quality score of your data is low (<20). What is a phred quality score and what are you doing?

A bunch of low quality sequences remain within the fastq files. Filter the data for quality and determine the number of reads remaining. Additionally, check what the phred profile looks like across the read, potential trailoff at the end of the read could indicate trimming would also improve the average score
Check whether you have phred 33 or 64 :)

Around 20% of your reads have an ambiguous nucleotide (N) at position 244. What can you do about it?

Potentially employ a trimming program (such as Trimmomatic) to remove a few bases from the end of the reads
The phred quality score of your R2 reads drops after position 200 below 20 (accuracy 99%). What are you doing? Trim the the last bases (~10nt)

The total amount of overrepresented sequences found in each library is around 50%. What are you doing?

If this is a amplicon dataset, then may indicate a contamination (with high richness in the samples) or simply a sample with low richness > Or just some dominant OTUs!?
If this is a metatranscriptomic dataset, then likely the presence of rRNA within the data. Apply a in silico rRNA filtering program
Do complexity filtering (remove low-complexity reads)

About 25% of the reads contain adaptor sequences. How can you explain this high values?

The amplicon region is both variable and rather short (around 300-400 bp); therefore, read through can occur.

You negative controls have similar read counts compare to your samples? What do you do?

Look at the sequence richness and possibly taxonomic level, i.e. is it just one OTU/strain that is dominant in your NC (=adjustable contamination).
Question is, what to do when sequence richness is high in your NC.
Check if OTU’s of NC are also in your samples. Then overall contamination, Re-start.
Check NC of 1^st PCR.
If read counts in NC are very very low and randomly distributed: Sequencing process can create artefacts that look like sequences from your samples; consider to clean your OTU table from this noise by subtracting read counts of NC from all OTU in each sample

Some of your samples have almost no reads. What do/can you do?

Discard the samples/resequence if possible
Check the barcoding
Check primers in case you used uniPrimer consider using some more targeted once like Arc and Bac only?
Check qPCR results, did the extraction work in the first place?
Check whether sample was pooled into final library (if not, just re-pool and sequence again).

You have multiple GC-content peaks. What do you do?

If the run is combined (different samples of different people) its not really a problem
Which animal am I? Dog. UnicornLiger

Read Merging¶

Less than 50% of the R1 and R2 can be merged. What can/should you do?

Consider rerun?
Additionally, you can treat the reads as single reads rather than paired. Also, can annotate R1 and R2 independently and compare the annotations to determine whether they agree

About 5% of the reads cannot be merged. What can you do to improve this?

The overlap length can be shortened and the errors allowed can be increased, but the tradeoff is potentially increasing the error rate

What parameters could influence the merging rate?

Similarity threshold
Overlap size
Quality filtering
Varying fragment lengths of your community (e.g. due to introns)
We could either trim reads before we merge or we could increase max error rate in the overlap to improve merging rate. What do you think is better?
Trimming reads will probably make the matching step go a little bit quicker, so trim very low quality bases (uninformative) right away and then increase max error rate until you get a good result

What could go wrong in the merging step that might have an influence for the data interpretation?

No overlap because amplicon is longer than 550.

Primer Trimming¶

How many mismatches can be tolerated in the primer region?

1 to 2 [why?]
Will mismatches in the primer sequence affect amplification rate?
Mismatches drastically affect PCR amplification efficiency
It really depends on the position within the primer, generally the influence is more detrimental in the last 3-4 bases of the primer (3’). But it is really difficult to predict otherwise, some may have almost no influence at all

Searching for the full-length primer sequences results in a very low trimming rate (<20%). Removing 3 nucleotides at the 3-prime end increases trimming rate substantially. What do you do?

Should not remove the 3-prime end if the primer you used is specific;
Run some tests, try to figure out why it happened

Do you need to trim the primer region of the merged reads and why?

Yes to get rid of the adapters
Primers may also contain ambiguous bases, which adds artifactual nucleotide diversity to the reads [why is diversity artifactual?]

Quality Filtering¶

The quality filtering step is removing 30% of your data. What can/should you do?

Its not good, be aware of that and consider rerun?
If the quality filtering only applies to the OTU clustering pipeline, whereas the OTU table itself is constructed by mapping ALL (unfiltered) reads against the OTUs obtained from filtered data, it may not be such a problem, and more filtering may reduce artifacts.
With high length variability (e.g. ITS), it may be important to make sure that there is no bias in sequence removal (long sequences tend to have more errors -> may be filtered out), it may be better to use an error rate limit than an absolute amount of expected seq. errors.

Size selection is removing 45% of your data. What should you do?

Think about if you can also continue your analysis with shorter reads, then change the ‘size’ parameter

Clustering¶

The clustering produces over 10’000 OTUs. What could be the explanation?

Maybe overestimated e.g. include some short, undesired sequences, errors caused during the PCR etc. A lot of singlet OTUs, a lot of rare OTUs. Spurious OTUs (biases)
If your source sample is known to have a high community complexity (i.e, soil) then is there a problem here?

What are chimeric DNA sequences and how would you removed them?

One sequence containing sequences from different organisms
I use Decipher or uchime to remove chimera on my data usearch/vsearch have a function to remove these also. Mixed products of different amplicons. Proof-reading taq polymerase, conserved regions within the amplicon pose a risk.

What is the difference between phylotype-based and OTU-based methods?

Phylotype-based methods take the evolutionary process into account and how closely two genotypes are related. OTU-based methods simply go based on sequence similarity. For simple evolution models this might be the same, but you can have fancy more realistic evolution models.

An OTU is conventionally defined as containing sequences that are no more than 3% different from each other.

Why are we using OTU as analysis unit?

We don’t have reference databases with the resolution to identify sequences to species level, so we group sequences into OTUs for comparison. Technical solution to sequencing based output. As close as possible to “species” or “genus”.

OTUs are obtained from clustering and not from classification. Why?

It’s faster… ?
Our taxonomic databases are not good enough for this yet?
OTUs based on sequencing result, and therefore can identify new or uncultivated OTUs.
Explain why amplicon based sequencing usually overestimates the community diversity or species abundance?
Errors at the PCR steps. Abundance can change due to amplification bias.
ASVs can be more sensitive so new “species” are created even when there is just 1 or 2 base pair differences
Spurious sequences might increase diversity, homologous sequences which are similar within one organism.

Annotation¶

Most of your OTUs have missing or bad annotation. What can you do?

You could try publishing the data as new taxa never found before. However, you should trace back possible problems, like poor quality filtering (too short reads, poor quality, etc).
1) Try other more suitable database.
2) phylogenetic tree
3) lower similarity percentage
4) different classifiers
5) size exclusion/quality filtering of OTUs

What are the criteria of a good reference database?

High-quality sequences + fairly good diversity
A good reference database should have high quality sequences with good annotation, if you are looking at species diversity you should include haplotype diversity

What are the difference between a classifier an a simple (best) blast hit approach?

I think its a problem with the database if you blast and there are many gaps the result you get is likely wrong and with classifier you might not get a hit but at least you get close to the “truth”?
BLAST is based on local alignment and may therefore in some cases will not take the full sequences into account, just a subset. Classifiers take the whole sequence into account and also try to estimate probabilities for a sequence to belong to certain species and higher levels of the hierarchy (but quality of estimations very much dependent on database quality).