Principal Component Analysis (PCA)
PCA is an unsupervised method that creates a new coordinate system for your data. Each axis (principal component) is a linear combination of the original variables, ordered by the amount of variance it explains. The first component captures the most variation, the second captures the most of what remains, and so on.
Because the components are uncorrelated by construction, PCA is also useful for removing multicollinearity before regression or classification, and for compressing many variables into a smaller set of informative axes before clustering.
PCA is unsupervised
PCA finds structure without using group labels. Any separation you see between groups in a PCA plot emerged from the data itself, not from telling the method which groups exist. This makes it a strong tool for exploration and for checking whether your grouping variable actually corresponds to biological signal.
Scaling
Variables on different scales must be standardised before PCA, otherwise variables with large variance dominate the first component regardless of their biological relevance.
# Check standard deviations
apply(iris[, 1:4], 2, sd)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 0.83 0.44 1.77 0.76
# Petal.Length has 4x more variance than Sepal.Width
# Without scaling, it would dominate PC1
Use scale. = TRUE in prcomp() unless all variables share the same unit and the variance differences are themselves meaningful.
Running PCA in R
library(factoextra)
data(iris)
pca <- prcomp(iris[, 1:4], scale. = TRUE)
summary(pca)
#> PC1 PC2 PC3 PC4
#> Proportion of Variance 0.730 0.229 0.037 0.005
#> Cumulative Proportion 0.730 0.958 0.995 1.000
PC1 and PC2 together explain 96% of the variance, so two dimensions capture nearly all the structure in this dataset. This is unusually high; in most biological datasets variance is more spread across components.
Scree Plot
A scree plot shows the proportion of variance explained by each component. Look for an elbow where the curve flattens, indicating that further components add little information.
fviz_screeplot(pca, addlabels = TRUE)
Three common criteria for deciding how many components to retain:
- Elbow rule: keep components up to the point where the curve flattens
- Cumulative variance: retain enough components to explain 80 to 90% of total variance
- Kaiser criterion: keep components with eigenvalue above 1 (each explains more than a single original variable)
# Cumulative variance
cumsum(pca$sdev^2 / sum(pca$sdev^2))
#> PC1 PC2 PC3 PC4
#> 0.73 0.96 0.99 1.00
# Kaiser criterion
sum(pca$sdev^2 > 1)
#> 2
In practice the three criteria often disagree. Use biological interpretability as the final arbiter.
Loadings
Loadings (the rotation matrix) show how much each original variable contributes to each component. Large absolute values indicate strong contributions; the sign indicates direction.
pca$rotation
#> PC1 PC2 PC3 PC4
#> Sepal.Length 0.521 -0.377 0.720 0.261
#> Sepal.Width -0.269 -0.923 -0.244 -0.124
#> Petal.Length 0.580 -0.024 -0.142 -0.801
#> Petal.Width 0.565 -0.067 -0.634 0.524
PC1 has large positive loadings for all petal and sepal length variables and a negative loading for sepal width. It can be interpreted as a general flower size axis: high PC1 scores correspond to large flowers with narrow sepals.
PC2 is dominated by sepal width (strong negative loading), separating flowers mainly by sepal shape.
A loading plot shows all variables simultaneously:
fviz_pca_var(pca,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
Score Plot and Biplot
The score plot shows where each observation falls in PC space. Colouring by a grouping variable reveals whether that variable corresponds to structure in the data:
fviz_pca_ind(pca,
col.ind = iris$Species,
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE,
legend.title = "Species")
A biplot overlays loadings and scores in the same plot. Observations lying in the direction of an arrow have high values for that variable:
fviz_pca_biplot(pca,
col.ind = iris$Species,
col.var = "black",
legend.title = "Species")
When PCA Works Well and When It Does Not
PCA is well-suited for continuous variables with approximately linear relationships, for detecting gradients, outliers, and batch effects, and as a preprocessing step before clustering or classification.
It is less appropriate when relationships are strongly nonlinear, when variables are categorical, when the dataset has far more variables than observations, or when data are compositional (microbiome, relative abundances). For compositional data, apply a CLR transformation before running PCA.
Exercise
Using the mtcars dataset:
- Run PCA with scaling on all numeric variables
- Plot the scree plot and decide how many components to retain
- Interpret the loadings of PC1: what does it represent?
- Plot scores coloured by number of cylinders (
cyl) — does PCA separate cars by cylinder count without being told to?
Solution
library(factoextra)
pca <- prcomp(mtcars, scale. = TRUE)
# Scree plot
fviz_screeplot(pca, addlabels = TRUE)
# PC1 and PC2 explain around 60% and 24% respectively
# Loadings
pca$rotation[, 1:2]
# PC1: high loadings for cyl, disp, hp, wt (engine size and weight)
# Interpretation: a "performance and size" axis
# Score plot coloured by cyl
fviz_pca_ind(pca,
col.ind = as.factor(mtcars$cyl),
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE,
legend.title = "Cylinders")
# Cars separate clearly by cylinder count along PC1