Method Selection
With many multivariate methods available, the right choice depends on three questions: what type of data do you have, what is your goal, and do you have group labels?
Master Decision Tree
graph TD
A[Multivariate data] --> B{Group labels?}
B -->|No| C{Goal?}
B -->|Yes| D{Outcome type?}
C -->|Reduce dimensions| E{Data type?}
C -->|Find groups| F[Clustering]
C -->|Visualise communities| G{Environmental data?}
E -->|Continuous| H[PCA]
E -->|Counts or compositional| I[PCoA or NMDS]
G -->|No| I
G -->|Yes| J[RDA or CCA]
D -->|Binary| K{Interpretability needed?}
D -->|Multiple groups| L{Assumptions met?}
K -->|Probabilities and coefficients| M[Logistic Regression]
K -->|Classification only| N{Normal, equal covariance?}
N -->|Yes| O[LDA]
N -->|No| M
L -->|Normal, equal covariance| O
L -->|Ecological or microbiome data| P[Ordination + PERMANOVA]
style H fill:#4CAF50
style I fill:#4CAF50
style F fill:#FF9800
style M fill:#2196F3
style O fill:#2196F3
style J fill:#9C27B0
style P fill:#9C27B0
Unsupervised Methods
graph TD
A[No group labels] --> B{Goal?}
B -->|Dimension reduction| C{Data type?}
B -->|Find clusters| D{Number of clusters known?}
B -->|Visualise communities| E{Environmental variables?}
C -->|Continuous, normal| F[PCA]
C -->|Counts or compositional| G[PCoA or NMDS]
D -->|Yes| H[K-means]
D -->|No| I[Hierarchical clustering]
E -->|No| G
E -->|Yes| J[RDA or CCA]
style F fill:#4CAF50
style G fill:#4CAF50
style H fill:#FF9800
style I fill:#FF9800
style J fill:#9C27B0
Supervised Methods
graph TD
A[Group labels available] --> B{Outcome type?}
B -->|Binary| C{Primary goal?}
B -->|Multiple groups| D{Data type?}
C -->|Probability estimates| E[Logistic Regression]
C -->|Classification| F{Normal, equal covariance?}
F -->|Yes| G[LDA]
F -->|No| E
D -->|Normal, equal covariance| G
D -->|Many predictors p >> n| H[LASSO Logistic Regression]
D -->|Ecological or microbiome| I[Ordination + PERMANOVA]
style E fill:#2196F3
style G fill:#2196F3
style H fill:#FF5722
style I fill:#9C27B0
By Data Type
| Data type |
Transformation |
Recommended method |
Distance |
| Continuous environmental variables |
scale() |
PCA |
Euclidean |
| Species abundance |
Hellinger |
NMDS or RDA |
Euclidean on transformed |
| Microbiome OTU / ASV |
CLR |
NMDS or Aitchison PCA |
Bray-Curtis or Euclidean |
| RNA-seq counts |
VST (DESeq2) |
PCA then clustering or LDA |
Euclidean |
| Single-cell RNA-seq |
PCA scores |
Graph-based clustering |
Graph |
| Presence / absence |
None |
NMDS |
Jaccard |
By Research Question
| Question |
Method |
Key function |
| What structure is in my data? |
PCA or NMDS |
prcomp(), metaMDS() |
| Are my samples naturally grouped? |
Hierarchical clustering |
hclust() |
| Are my groups significantly different? |
PERMANOVA |
adonis2() |
| Which variables drive the separation? |
PCA loadings or LDA scaling |
$rotation, $scaling |
| Which environmental variables explain community composition? |
RDA or CCA |
rda(), cca() |
| Can I predict group membership for new samples? |
LDA or logistic regression |
lda(), glm() |
| I have more predictors than samples |
LASSO or PCA then LDA |
cv.glmnet() |
By Scientific Field
Microbiome
| Question |
Method |
| Visualise community structure |
NMDS with Bray-Curtis |
| Test treatment or group effects |
PERMANOVA |
| Relate community to environment |
db-RDA |
| Predict disease from microbiome |
LASSO logistic regression on CLR data |
Genomics and Transcriptomics
| Question |
Method |
| Detect outliers and batch effects |
PCA on VST counts |
| Find co-expression modules |
Hierarchical clustering with correlation distance |
| Classify subtypes |
PCA then LDA, or LASSO |
| Differential expression |
DESeq2 or edgeR (not covered here) |
Ecology
| Question |
Method |
| Ordinate community data |
NMDS |
| Test habitat or treatment differences |
PERMANOVA |
| Constrained ordination |
RDA or CCA |
| Partition environmental variance |
varpart() |
Clinical and Medical
| Question |
Method |
| Predict binary disease status |
Logistic regression |
| Classify patient subtypes |
LDA |
| Feature selection from many biomarkers |
LASSO |
| Evaluate and compare predictive models |
ROC / AUC with cross-validation |
Method Comparison
| Method |
Supervised |
Group labels |
Main output |
Key assumption |
| PCA |
No |
No |
Components and loadings |
Linear relationships, Euclidean distance |
| Hierarchical clustering |
No |
No |
Dendrogram |
Distance metric appropriate |
| K-means |
No |
No |
Cluster assignments |
Spherical clusters, Euclidean distance |
| NMDS |
No |
No |
2D ordination |
Rank order of distances preserved |
| PCoA |
No |
No |
Axes and % variance |
Distance metric valid |
| RDA |
No |
Optional |
Constrained axes |
Linear, Euclidean (Hellinger recommended) |
| CCA |
No |
Optional |
Constrained axes |
Unimodal species responses |
| LDA |
Yes |
Yes |
Discriminant functions |
Normality, equal covariance |
| Logistic regression |
Yes |
Yes |
Probabilities and odds ratios |
Linearity of logit |
| PERMANOVA |
Yes |
Yes |
R2, p-value |
Exchangeable observations |
Assumption Violations
| Violation |
Affected methods |
Solution |
| Non-normal distributions |
PCA, LDA |
Transform data or use NMDS / logistic regression |
| Unequal covariances between groups |
LDA |
Use QDA or logistic regression |
| Compositional data (relative abundances) |
PCA, Euclidean distances |
CLR transformation before PCA; Bray-Curtis for NMDS |
| Many zeros |
Euclidean-based methods |
Filter rare features; use Bray-Curtis or Jaccard |
| More predictors than samples |
LDA, standard logistic regression |
PCA first, then LDA; LASSO logistic regression |
| Spatial or temporal autocorrelation |
PERMANOVA |
Use blocked permutation design |
Sample Size Guidelines
| Situation |
Recommendation |
| n < 20 |
Avoid LDA; prefer logistic regression or NMDS with permutation tests |
| p > n |
Regularisation (LASSO, ridge) or PCA dimensionality reduction before LDA |
| p >> n (p > 10n) |
LASSO or elastic net essential; standard regression will overfit |
| Fewer than 5 samples per group |
PERMANOVA p-values unreliable; treat as exploratory |