Zero-Inflated Data
This page covers special considerations for count data with many zeros, which is common in microbiome (OTU/ASV tables), RNA-seq, and ecological species abundance matrices. Standard correlation methods can produce misleading results with such data and should be applied with care.
Why Standard Correlation Falls Short
Three issues arise with zero-inflated count data:
Distributional assumptions. Pearson correlation assumes approximate normality. Count data with many zeros is strongly right-skewed and violates this assumption.
Compositionality. When data are relative abundances (proportions summing to 1), any increase in one taxon mathematically forces others to decrease. This creates spurious negative correlations between otherwise independent features.
Spurious correlations from rare features. Two rare OTUs that happen to co-occur in just a few samples can appear highly correlated even if the association is meaningless.
Transformations
Transforming the data before computing correlations is often sufficient for moderately sparse datasets.
Log transformation with a pseudocount handles skewness:
otu_log <- log(otu_table + 1)
cor(otu_log, method = "pearson")
Centered log-ratio (CLR) is the standard approach for compositional data. It removes the compositionality constraint:
library(compositions)
otu_clr <- clr(otu_table + 1)
cor(otu_clr, method = "pearson")
Variance stabilising transformation (VST) is recommended for RNA-seq count data:
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = t(count_matrix),
colData = sample_data,
design = ~ 1)
vsd <- vst(dds, blind = TRUE)
cor(t(assay(vsd)), method = "pearson")
Robust Correlation Methods
When transformation is not appropriate or sufficient, rank-based methods are more robust:
# Spearman: generally the safest default for count data
cor(otu_table, method = "spearman")
# Kendall: better for small samples or many tied ranks
cor(otu_table, method = "kendall")
Specialised Methods for Microbiome Data
For network analysis or correlation inference on highly sparse compositional data, dedicated methods outperform general-purpose transformations:
- SparCC — sparse correlations for compositional data (
SpiecEasipackage) - SPIEC-EASI — sparse inverse covariance estimation for ecological association inference
- propr — proportionality analysis as an alternative to correlation for compositional data
These are beyond the scope of this course but are worth knowing about for microbiome-specific analyses.
Recommended Workflow by Data Type
Microbiome (OTU/ASV): Filter rare features (present in less than 10% of samples), apply CLR transformation, use Spearman or SparCC for correlations.
RNA-seq: Filter low-count genes, apply VST or rlog (DESeq2), use Pearson on transformed values for correlation, DESeq2 or edgeR for differential expression.
Ecological species abundance: Filter rare species, use Bray-Curtis dissimilarity for ordination, apply Hellinger transformation before PCA, use PERMANOVA for group comparisons.
Further Reading
- Friedman & Alm (2012). Inferring correlation networks from genomic survey data. PLoS Computational Biology.
- Kurtz et al. (2015). Sparse and compositionally robust inference of microbial ecological networks. PLoS Computational Biology.
- Love et al. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.