Datasets and R Packages

This page lists the datasets and R packages used across the course. Packages are from CRAN unless noted as Bioconductor. Installation instructions are on the R Setup page.
Datasets
iris: Sepal and petal measurements for 150 flowers from three Iris species. Used throughout for PCA, LDA, and clustering examples.
dune / dune.env: Plant species abundance and environmental data from 20 dune meadow sites in the Netherlands (vegan package). Used for ordination, PERMANOVA, and constrained ordination.
mtcars: Performance and design measurements for 32 car models. Used for PCA and logistic regression exercises.
USArrests: Crime statistics for 50 US states. Used for the hierarchical clustering exercise.
airway: RNA-seq read counts from an airway smooth muscle cell experiment (Bioconductor). Used for the gene expression PCA example.
GlobalPatterns: 16S amplicon data from 26 samples across nine environment types (phyloseq, Bioconductor). Used for microbiome ordination, LDA, and logistic regression examples.
CRAN Packages
ggplot2: Plotting based on the grammar of graphics. Used throughout for ordination plots, score plots, and visualisation.
GGally: Extension of ggplot2. The ggpairs() function is used for pairwise correlation plots.
corrplot: Visualisation of correlation matrices. Used in the correlation and covariance section.
MASS: Functions from Modern Applied Statistics with S. Provides lda() for linear discriminant analysis and polr() for ordinal logistic regression.
factoextra: Visualisation of PCA, clustering, and other multivariate results. Used throughout for scree plots, score plots, and cluster visualisation.
FactoMineR: Extended PCA and multivariate methods. Available as a complement to factoextra.
cluster: Cluster analysis methods, including pam() (PAM), fanny() (fuzzy clustering), and silhouette(). Used in clustering examples and validation.
mclust: Model-based clustering and the Adjusted Rand Index (adjustedRandIndex()). Used for cluster validation.
dbscan: Density-based clustering. Used in the clustering advanced section.
vegan: Community ecology tools including vegdist(), metaMDS(), adonis2(), envfit(), rda(), cca(), and dbrda(). Central to ordination and constrained ordination.
ape: Provides pcoa() for principal coordinates analysis.
pairwiseAdonis: Pairwise PERMANOVA comparisons with correction for multiple testing.
dendextend: Tools for manipulating and colouring dendrograms. Used in the microbiome clustering example.
pheatmap: Heatmaps with integrated clustering. Used for the RNA-seq gene expression example.
caret: Unified interface for model training and cross-validation. Used for LDA and logistic regression.
car: Regression diagnostics including vif() for variance inflation factors and boxTidwell() for logit linearity checks.
pROC: ROC curves and AUC. Used in logistic regression model evaluation.
glmnet: Regularised regression including LASSO, ridge, and elastic net. Used in the microbiome classification example.
nnet: Multinomial logistic regression via multinom().
logistf: Firth's penalised logistic regression for datasets with complete separation.
Bioconductor Packages
phyloseq: Data structure and analysis tools for microbiome amplicon data. Provides wrappers for ordination, filtering, and transformation. Used throughout the biological application pages.
microbiome: Utilities for microbiome data, including the transform() function for CLR and other transformations.
DESeq2: Differential expression analysis for RNA-seq. The vst() function is used for variance stabilisation before PCA and clustering.
Seurat: Single-cell RNA-seq analysis. Used for the scRNA-seq clustering example including PCA, graph-based clustering, and UMAP.