PCA: Biological Applications

This page covers common applications of PCA in biological research. The core method is identical to what is described in the main PCA page; what differs is the preprocessing required for each data type.

Quality Control and Batch Effect Detection

PCA is one of the most effective tools for spotting technical problems before analysis. Running PCA on your samples and colouring points by technical variables (batch, plate, operator, sequencing run) immediately reveals whether samples cluster by biology or by artefact.

library(ggplot2)

pca_qc <- prcomp(your_data, scale. = TRUE)

plot_data <- data.frame(
  PC1 = pca_qc$x[, 1],
  PC2 = pca_qc$x[, 2],
  Batch     = sample_metadata$batch,
  Treatment = sample_metadata$treatment
)

# Check for batch effects
ggplot(plot_data, aes(PC1, PC2, colour = Batch, shape = Treatment)) +
  geom_point(size = 3) +
  labs(title = "QC: Batch vs Treatment") +
  theme_minimal()

If samples cluster by batch rather than by treatment, batch correction (e.g. limma::removeBatchEffect() or ComBat) is needed before proceeding.

Gene Expression (RNA-seq)

Raw RNA-seq counts should not be used directly in PCA. Variance stabilisation first ensures that highly expressed genes do not dominate the components.

library(DESeq2)
library(factoextra)

# Variance stabilising transformation
dds <- DESeqDataSetFromMatrix(countData = count_matrix,
                               colData = sample_data,
                               design = ~ treatment)
vsd <- vst(dds, blind = TRUE)  # blind = TRUE for QC/exploration

# PCA on VST-transformed data
# DESeq2 has a built-in function for this
plotPCA(vsd, intgroup = c("treatment", "batch"))

# Or manually for more control
mat <- t(assay(vsd))
pca <- prcomp(mat, scale. = TRUE)

fviz_pca_ind(pca,
             col.ind = sample_data$treatment,
             addEllipses = TRUE)

A well-behaved RNA-seq dataset should show samples clustering by biological condition, not by technical variables. If the first two PCs are driven by sequencing depth or batch, address those issues before differential expression analysis.

Microbiome Data

Microbiome OTU and ASV tables are compositional: values are relative, not absolute, and they sum to a constant. Standard PCA on raw or proportional data can produce spurious results due to this constraint. CLR transformation removes the compositional effect before PCA.

library(microbiome)
library(phyloseq)
library(factoextra)

# CLR transformation
ps_clr <- microbiome::transform(physeq, "clr")
otu_clr <- as.data.frame(t(otu_table(ps_clr)))

# PCA on CLR-transformed data
pca <- prcomp(otu_clr, scale. = FALSE)
# scale. = FALSE because CLR already centres and standardises

# Plot coloured by metadata variable
meta <- as(sample_data(physeq), "data.frame")

fviz_pca_ind(pca,
             col.ind = meta$SampleType,
             addEllipses = TRUE,
             legend.title = "Sample Type")

Note that Aitchison PCA (PCA on CLR-transformed data) is distinct from PCoA on Bray-Curtis distances. Both are valid but they answer slightly different questions. Aitchison PCA is more appropriate when you want to interpret loadings and identify which taxa drive sample separation.

Ecological Community Data

For species abundance matrices, the Hellinger transformation is recommended before PCA. It down-weights rare species and makes the results less sensitive to double-zeros (sites sharing absent species).

library(vegan)
library(factoextra)

data(dune)
data(dune.env)

# Hellinger transformation
dune_hell <- decostand(dune, method = "hellinger")

# PCA
pca <- prcomp(dune_hell, scale. = FALSE)
# scale. = FALSE: Hellinger transformation already handles scaling

# Plot coloured by management type
fviz_pca_ind(pca,
             col.ind = dune.env$Management,
             addEllipses = TRUE,
             legend.title = "Management")

# Biplot to see which species drive the axes
fviz_pca_biplot(pca,
                col.ind = dune.env$Management,
                col.var = "grey50",
                label = "var")

If you want to explicitly relate community composition to environmental variables rather than just visualising it, use constrained ordination (RDA) instead — see the Ordination page.

Quick Reference: Preprocessing by Data Type

Data type	Transformation	`scale.` in `prcomp()`
Environmental variables (mixed units)	None	`TRUE`
Environmental variables (same unit)	None	`FALSE` or `TRUE`
Gene expression (RNA-seq)	VST or rlog	`TRUE`
Microbiome (OTU/ASV)	CLR	`FALSE`
Species abundance	Hellinger	`FALSE`
Metabolomics	log + scale	`FALSE`