Principal component analysis (PCA)

Learning Objectives

▫︎ Explain the rationale and assumptions behind PCA as a dimensionality reduction technique.
▫︎ Apply PCA to multivariate data in R, including appropriate preprocessing.
▫︎ Critically interpret PCA results to uncover structure, gradients, or outliers in complex datasets.

What is PCA? A simple introduction

Principal component analysis (PCA) is a statistical method used to simplify complex data. It helps us understand patterns in datasets with many variables, such as measurements of multiple traits of organisms or environmental conditions at different sites.

PCA creates a new coordinate system where each axis (principal component) is a linear combination of the original variables. These components are ordered by the amount of variation they explain, with the first capturing the most.

The first principal component explains the greatest amount of variation in the data—often a substantial portion—but this can vary widely depending on the dataset and context. The second principal component explains the next largest amount of variation, and together, the first two components often capture a meaningful fraction of the total variation.

In some datasets, like the iris data, these two components can explain close to 90% of the variance, making it easy to visualize the data in two dimensions. However, in many biological datasets with complex variation, the variance may be spread more evenly across many components, so more dimensions are needed to capture the underlying structure.

It’s important to note that PCA assumes linear relationships and focuses on variance-based structure, so it might not capture complex nonlinear patterns or variables with little meaningful variance.

Using just the first few components often allows us to capture most of the important information in the data while ignoring the less informative variation—sometimes referred to as "noise" (not necessarily measurement error, but minor patterns or random variability). This makes PCA an excellent tool for reducing the number of variables, finding patterns, and creating visualizations of complex datasets.

For instance, rather than examining five environmental variables individually, PCA may demonstrate that the majority of variation between sites can be explained by just two components, enabling us to plot the data in a straightforward 2D graph and identify groups, gradients, or outliers more clearly.

PCA is often used as a preliminary step to simplify data before applying further analyses like clustering, classification, or ecological gradient interpretation.

Now that we understand the basic idea behind PCA, let’s see how it works in practice. We’ll use the well-known iris dataset to perform a simple PCA, visualize the results, and explore how the method helps us understand patterns in biological data.

Simple PCA Example in R

Let us create a PCA plot to visualise the data in just two dimensions. This type of plot can help us to identify patterns in the data, such as clusters, gradients or overlap between groups, even when the original dataset contains many variables.

# Load the data
data(iris)
# Run PCA (only on the numeric variables)
iris_pca <- prcomp(iris[, 1:4], scale. = TRUE)

# Plot PCA with color by species
library(ggplot2)
pca_data <- as.data.frame(iris_pca$x)
pca_data$Species <- iris$Species

ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
  geom_point(size = 3) +
  labs(title = "PCA of Iris Dataset", x = "PC1", y = "PC2") +
  theme_minimal()

Scree Plot: Visualizing Variance Explained by Principal Components

A scree plot helps us understand how much variance each principal component captures. It displays the proportion of total variance explained by each component, ordered from the first to the last.

By examining a scree plot, we can decide how many components are needed to summarize most of the variation in the data. Often, there is a point called the "elbow," where the amount of additional variance explained by further components drops off noticeably. This suggests a natural cutoff for how many components to keep.

In the iris dataset, you will likely see that the first two components explain most of the variance, which supports visualizing the data in just two dimensions. In other datasets, variance might be more spread out, indicating that more components are necessary to capture the underlying structure.

Here is an example of how to create a scree plot in R:

# Calculate proportion of variance explained
var_explained <- iris_pca$sdev^2 / sum(iris_pca$sdev^2)

# Create scree plot
library(ggplot2)
qplot(x = 1:length(var_explained), y = var_explained) +
  geom_line() +
  geom_point() +
  labs(x = "Principal Component", y = "Proportion of Variance Explained",
       title = "Scree Plot") +
  theme_minimal()

Now that we have visualised the variance explained by each principal component using the scree plot, let’s consider what this means for interpreting PCA results and deciding which components to focus on. Consider the following questions to deepen your understanding of these concepts.

How much of the total variation is explained by PC1 and PC2? (Hint: Look at summary(iris_pca))

Check the proportion of variance explained by PC1 and PC2. Typically, PC1 explains the most (often >50%), and PC2 captures the next largest amount. In the iris dataset, together they usually explain around 90%.

Which original variables contribute most to PC1 and PC2? (Hint: Check iris_pca$rotation)

PC1 mainly reflects Petal.Length and Petal.Width (large positive loadings), indicating these measurements explain most variation on PC1.
PC2 is influenced mostly by Sepal.Width (positive loading) and Sepal.Length (negative loading), showing this component differentiates based on sepal size.

Can you visually separate the three species based on PC1 and PC2? What might that suggest?

Yes! In the PCA plot, setosa flowers cluster distinctly from versicolor and virginica.
Versicolor and virginica overlap more but still show some separation.
This suggests that the measured traits (especially petal size) help differentiate species and PCA captures this separation well.

What does it mean that the data were 'scaled' before PCA? Why is that important here?

Scaling means each variable is standardized to have mean zero and standard deviation one before PCA.
This is important because variables are measured on different scales (e.g., length vs width), and scaling prevents variables with larger units from dominating the results.
Without scaling, variables with bigger numbers would have more influence on the principal components.

How many principal components would you consider retaining based on the scree plot? Why?

Look for the 'elbow' point where the proportion of variance explained starts to level off.
Consider keeping components up to that point to balance simplicity and information retained.

What does it mean if the scree plot shows a gradual decline without a clear elbow?

It suggests the variance is spread across many components.
More components may be needed to capture important patterns in the data.