Skip to content

This page offers a brief overview of the datasets and R packages employed throughout the course. We use a combination of traditional datasets, tidy data tools and specialised packages for multivariate analysis. Some packages are from CRAN, while others are from Bioconductor — particularly when working with high-dimensional or biological data. These resources support a variety of techniques, including principal component analysis (PCA), clustering, ordination and data visualisation.

R Data

  • iris: Measurements of sepal and petal length and width for 150 iris flowers from three species.

  • dune: Plant species abundance data from 20 dune meadow sites in the Netherlands.

R Package

  • tidyverse: Collection of R packages for data manipulation, exploration, and visualization using a consistent syntax.
  • ggplot2: Create elegant and customizable data visualizations using the grammar of graphics.
  • GGally: Extension of ggplot2 that adds functions for exploring relationships between variables, including the powerful ggpairs() for visualizing pairwise plots.
  • MASS: Functions and datasets from Modern Applied Statistics with S, including tools for linear and multivariate analysis.
  • cluster: Methods for cluster analysis, including hierarchical, k-means, and partitioning around medoids.
  • vegan: Tools for community ecology, including ordination, diversity analysis, and dissimilarity measures.
  • factoextra: Extract and visualize the results of multivariate data analyses.

Bioconductor Packages for General Multivariate Statistics

  • mixOmics: General multivariate analysis and (omics) data integration — supports PCA, PLS(-DA), clustering, and network visualization. Very flexible for non-omics use too.
  • FactoMineR: Comprehensive tools for PCA, MCA, CA, and clustering — easy to use and great for teaching. (Also on CRAN but widely used in Bioconductor workflows.)
  • pcaMethods: PCA with robust methods for missing data — includes standard, probabilistic, and Bayesian PCA.
  • made4: Classic multivariate methods (e.g. CA, PCA, clustering) with visualization tools, originally for microarrays but generalizable.

These packages are more general-purpose and user-friendly than highly specialized ones like MOFA2 or DESeq2. You can combine them with CRAN packages like factoextra, cluster, or vegan for visualization and additional methods.

More Multivariate Statistics Packages on Bioconductor

  • MOFA2: Multi-Omics Factor Analysis – unsupervised learning across multiple omics layers.
  • DESeq2: Although primarily for differential expression, includes rlog and variance-stabilized PCA for multivariate exploration.
  • edgeR: Like DESeq2, supports PCA and MDS for visualizing high-dimensional count data.
  • limma: Contains tools for PCA and correlation analysis in microarray and RNA-seq data.