Why Multivariate Statistics in the Age of AI?
A question that comes up often: "Why learn classical statistics when everyone is doing machine learning?"
The short answer is that they are not competing approaches. Multivariate statistics and modern AI methods share the same mathematical foundations, and you need one to use the other well.
The Methods Are the Same
The techniques covered in this course are not alternatives to machine learning; they are its building blocks:
- k-means and hierarchical clustering are core unsupervised learning algorithms
- PCA underpins dimensionality reduction across all of modern ML, including autoencoders
- LDA is a discriminative classifier still widely used in practice
- t-SNE and UMAP are direct extensions of classical manifold learning
Understanding these foundations means you know what is actually happening inside the models you run, and you can recognise when they fail.
Interpretation Matters in Research
Prediction accuracy is rarely the primary goal in biological and ecological research. More often, the question is why: which variables drive the pattern, which species characterise the groups, which environmental gradients explain community composition.
Multivariate statistics answers these questions directly. PCA loadings, LDA coefficients, and ordination biplots tell you something biologically meaningful. A neural network, by contrast, optimises prediction but offers little insight into mechanism.
Interpretable results are also easier to publish and easier to defend in peer review. "Why not use a simpler method?" remains one of the most common reviewer comments in ecology and biology.
Small Data is the Norm
Deep learning requires large datasets. Most biological studies do not have them. The methods in this course were specifically developed for the sample sizes typical in ecology, microbiology, and experimental biology: tens to low hundreds of observations. They provide proper statistical inference (p-values, confidence intervals, effect sizes) which deep learning approaches generally cannot.
Exploration Before Modelling
Regardless of which analytical approach you ultimately use, multivariate exploration comes first:
# Check structure and quality before modelling
pca <- prcomp(data, scale. = TRUE)
fviz_pca_biplot(pca) # Batch effects? Outliers? Dominant gradients?
km <- kmeans(data, centers = 3)
table(km$cluster, metadata$treatment) # Do clusters match expectations?
Feeding unchecked data into a complex model amplifies problems rather than solving them. The statistical workflow is the quality control layer.
A Practical Workflow
Multivariate statistics and machine learning are most powerful in combination:
graph TD
A[Raw Data] --> B[Quality Control]
B --> C[PCA: check structure and batch effects]
C --> D[Clustering: identify groups]
D --> E[Statistical tests: are groups significant?]
E --> F{Prediction needed?}
F -->|No| G[Interpret and publish]
F -->|Yes| H[Build ML model]
H --> I[Validate with statistics]
I --> G
style C fill:#16A085
style D fill:#16A085
style E fill:#16A085
style H fill:#E67E22
The statistical steps are not optional even when the endpoint is a machine learning model. They are how you ensure the model is built on solid ground.
"All models are wrong, but some are useful." (George Box)