Clustering: Biological Applications

This page provides worked examples of clustering applied to common data types in biology and ecology. The methods are the same as those covered in the main clustering page; what changes is the preprocessing and the choice of distance metric.

Microbiome Sample Clustering

For OTU or ASV tables, Bray-Curtis dissimilarity is the standard choice. Data should be filtered to remove rare taxa and, if compositional effects are a concern, CLR-transformed before computing distances.

library(vegan)
library(phyloseq)
library(microbiome)
library(dendextend)

# Load example data
data(GlobalPatterns)
gp <- GlobalPatterns

# 1. Filter rare taxa (present in at least 10% of samples)
gp_filtered <- filter_taxa(gp,
  function(x) sum(x > 3) > (0.1 * length(x)), TRUE)

# 2. CLR transformation (handles compositionality)
gp_clr <- microbiome::transform(gp_filtered, "clr")
otu_clr <- as.matrix(otu_table(gp_clr))

# 3. Bray-Curtis distance on CLR-transformed data
dist_bray <- vegdist(t(otu_clr), method = "bray")

# 4. Hierarchical clustering
hc <- hclust(dist_bray, method = "ward.D2")

# 5. Colour dendrogram by sample type
sample_meta <- as(sample_data(gp), "data.frame")
dend <- as.dendrogram(hc)
sample_types <- sample_meta[labels(dend), "SampleType"]
colors <- rainbow(length(unique(sample_types)))
names(colors) <- unique(sample_types)
labels_colors(dend) <- colors[as.character(sample_types)]

plot(dend, main = "Microbiome Sample Clustering by Sample Type")

Gene Expression Clustering

RNA-seq count data requires variance stabilisation before clustering. The DESeq2 VST transformation is the standard approach. Clustering is typically applied both to samples (to check for batch effects or treatment groups) and to genes (to find co-expression patterns).

library(DESeq2)
library(pheatmap)

data("airway")
dds <- DESeqDataSet(airway, design = ~ cell + dex)

# 1. Variance stabilising transformation
vsd <- vst(dds, blind = FALSE)
mat <- assay(vsd)

# 2. Select top 50 most variable genes
rv <- rowVars(mat)
top_genes <- order(rv, decreasing = TRUE)[1:50]
mat_subset <- mat[top_genes, ]

# 3. Scale across samples (row-wise)
mat_scaled <- t(scale(t(mat_subset)))

# 4. Cluster genes and samples simultaneously
pheatmap(mat_scaled,
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean",
         clustering_method = "ward.D2",
         show_rownames = FALSE,
         annotation_col = as.data.frame(colData(dds)[, c("cell", "dex")]),
         main = "Gene Expression Heatmap")

The column annotation helps immediately identify whether samples cluster by treatment (dex) or by cell line (cell), which is the first quality check in most RNA-seq workflows.

Single-Cell RNA-seq

Single-cell data involves tens of thousands of cells and genes, making standard clustering impractical without dimensionality reduction first. The standard workflow in Seurat runs PCA before clustering, then uses UMAP for visualisation.

library(Seurat)

# Assuming a Seurat object 'pbmc' is already loaded

# 1. Normalise and find variable features
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc)

# 2. Scale and run PCA
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc)

# 3. Build neighbour graph and cluster
# (clustering happens in PCA space, not raw gene space)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

# 4. UMAP for visualisation
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap", label = TRUE)

The resolution parameter in FindClusters controls granularity: lower values give fewer, broader clusters; higher values give more, finer clusters. Biological annotation of clusters (cell type assignment) is done after clustering by examining marker gene expression per cluster.

Ecological Species Data

For species abundance matrices in ecology, the workflow mirrors microbiome analysis but typically uses Bray-Curtis directly without CLR transformation, combined with Hellinger transformation for ordination-based approaches.

library(vegan)

data(dune)

# 1. Hellinger transformation (recommended for species data)
dune_hell <- decostand(dune, method = "hellinger")

# 2. Euclidean distance on Hellinger-transformed data
# (equivalent to chord distance)
dist_hell <- dist(dune_hell, method = "euclidean")

# 3. Cluster sites
hc <- hclust(dist_hell, method = "ward.D2")
plot(hc, main = "Dune Meadow Sites")

# 4. Validate with PERMANOVA using environmental data
data(dune.env)
adonis2(dist_hell ~ Management, data = dune.env, permutations = 999)

Quick Reference: Data Type to Workflow

Data type	Transformation	Distance	Notes
Continuous (env. vars)	`scale()`	Euclidean	Always scale
Species counts	Hellinger or none	Bray-Curtis	Filter rare species
OTU/ASV table	CLR	Bray-Curtis	Filter rare taxa first
RNA-seq counts	VST (DESeq2)	Euclidean	Filter low-count genes
Single-cell RNA-seq	PCA scores	Graph-based	Use Seurat/Scanpy