Choosing the Right Method

The most important question before choosing a method is: what is your goal?

If you want to explain relationships, test hypotheses, or estimate effect sizes, multivariate statistical models are the right tool. If you want to predict outcomes from complex or high-dimensional data and interpretability is secondary, machine learning methods are more appropriate. Many analyses combine both.

The n/p ratio matters more than the raw number of variables

The ratio of observations to predictors (n/p) is a better guide than the absolute number of variables. A low n/p ratio increases the risk of overfitting regardless of whether you use classical statistics or machine learning.

Multivariate Statistical Models

Use these when the predictor count is modest relative to sample size (roughly n/p > 10), you need interpretable coefficients, p-values, or effect sizes, you are testing a specific hypothesis, or the model structure is guided by theory.

Methods covered in this course include PCA, clustering, LDA, ordination (NMDS, PCoA), constrained ordination (RDA, CCA, db-RDA), PERMANOVA, logistic regression, and regularised regression (LASSO, ridge, elastic net).

Other common methods in this family include MANOVA, canonical correlation analysis, and structural equation modelling.

Example: you are studying how five soil nutrients affect plant community composition across 80 sites. The n/p ratio is comfortable, you have a clear hypothesis, and you want interpretable results. Use multivariate regression or RDA.

Machine Learning Models

Use these when the goal is prediction accuracy rather than parameter interpretation, relationships may be non-linear or involve high-order interactions, data are high-dimensional relative to sample size, or data are unstructured (images, sequences, text).

Common methods include random forests, gradient boosting (XGBoost), support vector machines, and neural networks. These models often achieve better predictive performance than statistical models but require additional tools such as SHAP values for post-hoc interpretation.

Example: you have 10,000 gene expression features and want to predict disease status in a held-out cohort. Prediction accuracy is the priority. Use a machine learning model with cross-validation and a held-out test set.

Hybrid Approaches

Many analyses combine statistical modelling and machine learning. PCA or PLS can reduce dimensionality before feeding data into a machine learning model. LASSO and elastic net sit at the boundary: they are statistical regression models that perform variable selection and handle high-dimensional data similarly to ML methods. SHAP values and permutation importance can be used to interpret otherwise opaque ML models. Comparing a statistical and a machine learning model on the same data is a useful robustness check.

Decision Guide

Data situation	Typical goal	Recommended approach
n/p > 20, theory-driven	Explanation and inference	Multivariate statistical models
5 < n/p < 20, correlated predictors	Explanation and prediction	Regularised regression, PLS
n/p < 5, complex nonlinear structure	Prediction	Machine learning (RF, boosting, NN)

Key Points

Cross-validation is essential regardless of method: overfitting is a risk whenever predictors are numerous relative to sample size. Regularised statistical models handle high-dimensional data effectively and should be considered before reaching for a black-box ML model. Method choice should be driven by research purpose: hypothesis testing favours statistical models, prediction tasks may favour ML, and transparency requirements favour interpretable models. ML models are not inherently better: they optimise different objectives.