Zero-Inflated Data

This page covers special considerations for count data with many zeros, which is common in microbiome (OTU/ASV tables), RNA-seq, and ecological species abundance matrices. Standard correlation methods can produce misleading results with such data and should be applied with care.

Why Standard Correlation Falls Short

Three issues arise with zero-inflated count data:

Distributional assumptions. Pearson correlation assumes approximate normality. Count data with many zeros is strongly right-skewed and violates this assumption.

Compositionality. When data are relative abundances (proportions summing to 1), any increase in one taxon mathematically forces others to decrease. This creates spurious negative correlations between otherwise independent features.

Spurious correlations from rare features. Two rare OTUs that happen to co-occur in just a few samples can appear highly correlated even if the association is meaningless.

Transformations

Transforming the data before computing correlations is often sufficient for moderately sparse datasets.

Log transformation with a pseudocount handles skewness:

otu_log <- log(otu_table + 1)
cor(otu_log, method = "pearson")

Centered log-ratio (CLR) is the standard approach for compositional data. It removes the compositionality constraint:

library(compositions)
otu_clr <- clr(otu_table + 1)
cor(otu_clr, method = "pearson")

Variance stabilising transformation (VST) is recommended for RNA-seq count data:

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = t(count_matrix),
                               colData = sample_data,
                               design = ~ 1)
vsd <- vst(dds, blind = TRUE)
cor(t(assay(vsd)), method = "pearson")

Robust Correlation Methods

When transformation is not appropriate or sufficient, rank-based methods are more robust:

# Spearman: generally the safest default for count data
cor(otu_table, method = "spearman")

# Kendall: better for small samples or many tied ranks
cor(otu_table, method = "kendall")

Specialised Methods for Microbiome Data

For network analysis or correlation inference on highly sparse compositional data, dedicated methods outperform general-purpose transformations:

SparCC — sparse correlations for compositional data (SpiecEasi package)
SPIEC-EASI — sparse inverse covariance estimation for ecological association inference
propr — proportionality analysis as an alternative to correlation for compositional data

These are beyond the scope of this course but are worth knowing about for microbiome-specific analyses.

Recommended Workflow by Data Type

Microbiome (OTU/ASV): Filter rare features (present in less than 10% of samples), apply CLR transformation, use Spearman or SparCC for correlations.

RNA-seq: Filter low-count genes, apply VST or rlog (DESeq2), use Pearson on transformed values for correlation, DESeq2 or edgeR for differential expression.

Ecological species abundance: Filter rare species, use Bray-Curtis dissimilarity for ordination, apply Hellinger transformation before PCA, use PERMANOVA for group comparisons.