Method Selection

With many multivariate methods available, the right choice depends on three questions: what type of data do you have, what is your goal, and do you have group labels?

Master Decision Tree

graph TD
    A[Multivariate data] --> B{Group labels?}

    B -->|No| C{Goal?}
    B -->|Yes| D{Outcome type?}

    C -->|Reduce dimensions| E{Data type?}
    C -->|Find groups| F[Clustering]
    C -->|Visualise communities| G{Environmental data?}

    E -->|Continuous| H[PCA]
    E -->|Counts or compositional| I[PCoA or NMDS]

    G -->|No| I
    G -->|Yes| J[RDA or CCA]

    D -->|Binary| K{Interpretability needed?}
    D -->|Multiple groups| L{Assumptions met?}

    K -->|Probabilities and coefficients| M[Logistic Regression]
    K -->|Classification only| N{Normal, equal covariance?}
    N -->|Yes| O[LDA]
    N -->|No| M

    L -->|Normal, equal covariance| O
    L -->|Ecological or microbiome data| P[Ordination + PERMANOVA]

    style H fill:#4CAF50
    style I fill:#4CAF50
    style F fill:#FF9800
    style M fill:#2196F3
    style O fill:#2196F3
    style J fill:#9C27B0
    style P fill:#9C27B0

Unsupervised Methods

graph TD
    A[No group labels] --> B{Goal?}

    B -->|Dimension reduction| C{Data type?}
    B -->|Find clusters| D{Number of clusters known?}
    B -->|Visualise communities| E{Environmental variables?}

    C -->|Continuous, normal| F[PCA]
    C -->|Counts or compositional| G[PCoA or NMDS]

    D -->|Yes| H[K-means]
    D -->|No| I[Hierarchical clustering]

    E -->|No| G
    E -->|Yes| J[RDA or CCA]

    style F fill:#4CAF50
    style G fill:#4CAF50
    style H fill:#FF9800
    style I fill:#FF9800
    style J fill:#9C27B0

Supervised Methods

graph TD
    A[Group labels available] --> B{Outcome type?}

    B -->|Binary| C{Primary goal?}
    B -->|Multiple groups| D{Data type?}

    C -->|Probability estimates| E[Logistic Regression]
    C -->|Classification| F{Normal, equal covariance?}

    F -->|Yes| G[LDA]
    F -->|No| E

    D -->|Normal, equal covariance| G
    D -->|Many predictors p >> n| H[LASSO Logistic Regression]
    D -->|Ecological or microbiome| I[Ordination + PERMANOVA]

    style E fill:#2196F3
    style G fill:#2196F3
    style H fill:#FF5722
    style I fill:#9C27B0

By Data Type

Data type	Transformation	Recommended method	Distance
Continuous environmental variables	`scale()`	PCA	Euclidean
Species abundance	Hellinger	NMDS or RDA	Euclidean on transformed
Microbiome OTU / ASV	CLR	NMDS or Aitchison PCA	Bray-Curtis or Euclidean
RNA-seq counts	VST (DESeq2)	PCA then clustering or LDA	Euclidean
Single-cell RNA-seq	PCA scores	Graph-based clustering	Graph
Presence / absence	None	NMDS	Jaccard

By Research Question

Question	Method	Key function
What structure is in my data?	PCA or NMDS	`prcomp()`, `metaMDS()`
Are my samples naturally grouped?	Hierarchical clustering	`hclust()`
Are my groups significantly different?	PERMANOVA	`adonis2()`
Which variables drive the separation?	PCA loadings or LDA scaling	`$rotation`, `$scaling`
Which environmental variables explain community composition?	RDA or CCA	`rda()`, `cca()`
Can I predict group membership for new samples?	LDA or logistic regression	`lda()`, `glm()`
I have more predictors than samples	LASSO or PCA then LDA	`cv.glmnet()`

By Scientific Field

Microbiome

Question	Method
Visualise community structure	NMDS with Bray-Curtis
Test treatment or group effects	PERMANOVA
Relate community to environment	db-RDA
Predict disease from microbiome	LASSO logistic regression on CLR data

Genomics and Transcriptomics

Question	Method
Detect outliers and batch effects	PCA on VST counts
Find co-expression modules	Hierarchical clustering with correlation distance
Classify subtypes	PCA then LDA, or LASSO
Differential expression	DESeq2 or edgeR (not covered here)

Ecology

Question	Method
Ordinate community data	NMDS
Test habitat or treatment differences	PERMANOVA
Constrained ordination	RDA or CCA
Partition environmental variance	`varpart()`

Clinical and Medical

Question	Method
Predict binary disease status	Logistic regression
Classify patient subtypes	LDA
Feature selection from many biomarkers	LASSO
Evaluate and compare predictive models	ROC / AUC with cross-validation

Method Comparison

Method	Supervised	Group labels	Main output	Key assumption
PCA	No	No	Components and loadings	Linear relationships, Euclidean distance
Hierarchical clustering	No	No	Dendrogram	Distance metric appropriate
K-means	No	No	Cluster assignments	Spherical clusters, Euclidean distance
NMDS	No	No	2D ordination	Rank order of distances preserved
PCoA	No	No	Axes and % variance	Distance metric valid
RDA	No	Optional	Constrained axes	Linear, Euclidean (Hellinger recommended)
CCA	No	Optional	Constrained axes	Unimodal species responses
LDA	Yes	Yes	Discriminant functions	Normality, equal covariance
Logistic regression	Yes	Yes	Probabilities and odds ratios	Linearity of logit
PERMANOVA	Yes	Yes	R2, p-value	Exchangeable observations

Assumption Violations

Violation	Affected methods	Solution
Non-normal distributions	PCA, LDA	Transform data or use NMDS / logistic regression
Unequal covariances between groups	LDA	Use QDA or logistic regression
Compositional data (relative abundances)	PCA, Euclidean distances	CLR transformation before PCA; Bray-Curtis for NMDS
Many zeros	Euclidean-based methods	Filter rare features; use Bray-Curtis or Jaccard
More predictors than samples	LDA, standard logistic regression	PCA first, then LDA; LASSO logistic regression
Spatial or temporal autocorrelation	PERMANOVA	Use blocked permutation design

Sample Size Guidelines

Situation	Recommendation
n < 20	Avoid LDA; prefer logistic regression or NMDS with permutation tests
p > n	Regularisation (LASSO, ridge) or PCA dimensionality reduction before LDA
p >> n (p > 10n)	LASSO or elastic net essential; standard regression will overfit
Fewer than 5 samples per group	PERMANOVA p-values unreliable; treat as exploratory