Skip to content

Method Selection

With many multivariate methods available, the right choice depends on three questions: what type of data do you have, what is your goal, and do you have group labels?


Master Decision Tree

graph TD
    A[Multivariate data] --> B{Group labels?}

    B -->|No| C{Goal?}
    B -->|Yes| D{Outcome type?}

    C -->|Reduce dimensions| E{Data type?}
    C -->|Find groups| F[Clustering]
    C -->|Visualise communities| G{Environmental data?}

    E -->|Continuous| H[PCA]
    E -->|Counts or compositional| I[PCoA or NMDS]

    G -->|No| I
    G -->|Yes| J[RDA or CCA]

    D -->|Binary| K{Interpretability needed?}
    D -->|Multiple groups| L{Assumptions met?}

    K -->|Probabilities and coefficients| M[Logistic Regression]
    K -->|Classification only| N{Normal, equal covariance?}
    N -->|Yes| O[LDA]
    N -->|No| M

    L -->|Normal, equal covariance| O
    L -->|Ecological or microbiome data| P[Ordination + PERMANOVA]

    style H fill:#4CAF50
    style I fill:#4CAF50
    style F fill:#FF9800
    style M fill:#2196F3
    style O fill:#2196F3
    style J fill:#9C27B0
    style P fill:#9C27B0

Unsupervised Methods

graph TD
    A[No group labels] --> B{Goal?}

    B -->|Dimension reduction| C{Data type?}
    B -->|Find clusters| D{Number of clusters known?}
    B -->|Visualise communities| E{Environmental variables?}

    C -->|Continuous, normal| F[PCA]
    C -->|Counts or compositional| G[PCoA or NMDS]

    D -->|Yes| H[K-means]
    D -->|No| I[Hierarchical clustering]

    E -->|No| G
    E -->|Yes| J[RDA or CCA]

    style F fill:#4CAF50
    style G fill:#4CAF50
    style H fill:#FF9800
    style I fill:#FF9800
    style J fill:#9C27B0

Supervised Methods

graph TD
    A[Group labels available] --> B{Outcome type?}

    B -->|Binary| C{Primary goal?}
    B -->|Multiple groups| D{Data type?}

    C -->|Probability estimates| E[Logistic Regression]
    C -->|Classification| F{Normal, equal covariance?}

    F -->|Yes| G[LDA]
    F -->|No| E

    D -->|Normal, equal covariance| G
    D -->|Many predictors p >> n| H[LASSO Logistic Regression]
    D -->|Ecological or microbiome| I[Ordination + PERMANOVA]

    style E fill:#2196F3
    style G fill:#2196F3
    style H fill:#FF5722
    style I fill:#9C27B0

By Data Type

Data type Transformation Recommended method Distance
Continuous environmental variables scale() PCA Euclidean
Species abundance Hellinger NMDS or RDA Euclidean on transformed
Microbiome OTU / ASV CLR NMDS or Aitchison PCA Bray-Curtis or Euclidean
RNA-seq counts VST (DESeq2) PCA then clustering or LDA Euclidean
Single-cell RNA-seq PCA scores Graph-based clustering Graph
Presence / absence None NMDS Jaccard

By Research Question

Question Method Key function
What structure is in my data? PCA or NMDS prcomp(), metaMDS()
Are my samples naturally grouped? Hierarchical clustering hclust()
Are my groups significantly different? PERMANOVA adonis2()
Which variables drive the separation? PCA loadings or LDA scaling $rotation, $scaling
Which environmental variables explain community composition? RDA or CCA rda(), cca()
Can I predict group membership for new samples? LDA or logistic regression lda(), glm()
I have more predictors than samples LASSO or PCA then LDA cv.glmnet()

By Scientific Field

Microbiome

Question Method
Visualise community structure NMDS with Bray-Curtis
Test treatment or group effects PERMANOVA
Relate community to environment db-RDA
Predict disease from microbiome LASSO logistic regression on CLR data

Genomics and Transcriptomics

Question Method
Detect outliers and batch effects PCA on VST counts
Find co-expression modules Hierarchical clustering with correlation distance
Classify subtypes PCA then LDA, or LASSO
Differential expression DESeq2 or edgeR (not covered here)

Ecology

Question Method
Ordinate community data NMDS
Test habitat or treatment differences PERMANOVA
Constrained ordination RDA or CCA
Partition environmental variance varpart()

Clinical and Medical

Question Method
Predict binary disease status Logistic regression
Classify patient subtypes LDA
Feature selection from many biomarkers LASSO
Evaluate and compare predictive models ROC / AUC with cross-validation

Method Comparison

Method Supervised Group labels Main output Key assumption
PCA No No Components and loadings Linear relationships, Euclidean distance
Hierarchical clustering No No Dendrogram Distance metric appropriate
K-means No No Cluster assignments Spherical clusters, Euclidean distance
NMDS No No 2D ordination Rank order of distances preserved
PCoA No No Axes and % variance Distance metric valid
RDA No Optional Constrained axes Linear, Euclidean (Hellinger recommended)
CCA No Optional Constrained axes Unimodal species responses
LDA Yes Yes Discriminant functions Normality, equal covariance
Logistic regression Yes Yes Probabilities and odds ratios Linearity of logit
PERMANOVA Yes Yes R2, p-value Exchangeable observations

Assumption Violations

Violation Affected methods Solution
Non-normal distributions PCA, LDA Transform data or use NMDS / logistic regression
Unequal covariances between groups LDA Use QDA or logistic regression
Compositional data (relative abundances) PCA, Euclidean distances CLR transformation before PCA; Bray-Curtis for NMDS
Many zeros Euclidean-based methods Filter rare features; use Bray-Curtis or Jaccard
More predictors than samples LDA, standard logistic regression PCA first, then LDA; LASSO logistic regression
Spatial or temporal autocorrelation PERMANOVA Use blocked permutation design

Sample Size Guidelines

Situation Recommendation
n < 20 Avoid LDA; prefer logistic regression or NMDS with permutation tests
p > n Regularisation (LASSO, ridge) or PCA dimensionality reduction before LDA
p >> n (p > 10n) LASSO or elastic net essential; standard regression will overfit
Fewer than 5 samples per group PERMANOVA p-values unreliable; treat as exploratory