Linear Discriminant Analysis (LDA)

LDA finds linear combinations of variables that best separate predefined groups. It is the supervised counterpart to PCA: where PCA finds directions of maximum variance without using group labels, LDA finds directions of maximum separation between groups using group labels.

This distinction matters in practice. PCA may reveal that species separate along PC1, but only because the variance that drives PC1 happens to correlate with species identity. LDA explicitly optimises for that separation, so it tends to produce cleaner group discrimination when your groups are real and the assumptions are met.

PCA vs LDA

The simplest way to see the difference is to run both on the same data:

library(MASS)
library(ggplot2)
library(gridExtra)

data(iris)

# PCA: no group labels
pca <- prcomp(iris[, 1:4], scale. = TRUE)

# LDA: uses Species labels
lda_model <- lda(Species ~ ., data = iris)
lda_pred  <- predict(lda_model)

# PCA plot
p1 <- ggplot(data.frame(pca$x, Species = iris$Species),
             aes(PC1, PC2, colour = Species)) +
  geom_point(size = 2) +
  stat_ellipse() +
  ggtitle("PCA: maximises variance") +
  theme_minimal()

# LDA plot
p2 <- ggplot(data.frame(lda_pred$x, Species = iris$Species),
             aes(LD1, LD2, colour = Species)) +
  geom_point(size = 2) +
  stat_ellipse() +
  ggtitle("LDA: maximises separation") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

Both separate the three species, but LDA does so more cleanly because it is explicitly trying to maximise the ratio of between-group to within-group variance.

Assumptions

LDA makes stronger assumptions than PCA. Violating them does not always invalidate the results, but it is worth checking before interpreting loadings or using the model for classification.

Multivariate normality: each group should follow an approximately multivariate normal distribution.

Homogeneity of covariance: all groups should have the same covariance matrix. If they do not, use QDA instead.

No severe multicollinearity: highly correlated predictors inflate loadings. Check with a correlation matrix or VIF before fitting.

library(biotools)

# Box's M test for homogeneity of covariance
boxM(iris[, 1:4], iris$Species)
# Non-significant: assumption plausibly met
# Significant: consider QDA

# Correlation check
cor(iris[, 1:4])

When sample size is small relative to the number of variables (roughly n < 5p), LDA is prone to overfitting. Reduce dimensions with PCA first, then run LDA on the PC scores.

Fitting LDA in R

library(MASS)

lda_model <- lda(Species ~ ., data = iris)
print(lda_model)

Proportion of separation explained

The svd slot gives the singular values of the discriminant axes. Their squared values, expressed as proportions, show how much of the between-group separation each axis accounts for:

lda_model$svd^2 / sum(lda_model$svd^2)
#>   LD1   LD2
#> 0.991 0.009

LD1 accounts for 99% of the separation in the iris data, so the second axis adds almost nothing.

Loadings

The scaling matrix shows how much each variable contributes to each discriminant axis:

lda_model$scaling
#>                    LD1        LD2
#> Sepal.Length  0.829  0.024
#> Sepal.Width   1.534  2.165
#> Petal.Length -2.201  0.932
#> Petal.Width  -2.810 -2.839

Large absolute values indicate variables that strongly drive separation along that axis. Petal.Length and Petal.Width have the largest loadings on LD1, meaning petal dimensions are the primary basis for separating the three species.

Predictions and posterior probabilities

pred <- predict(lda_model)

# Predicted class labels
head(pred$class)

# Posterior probabilities: confidence of each prediction
head(pred$posterior)
#   setosa    versicolor     virginica
# 1      1  3.896358e-22  2.611168e-42
# 2      1  7.217970e-18  5.042143e-37
# 3      1  1.463849e-19  4.675932e-39

# Confusion matrix on training data
table(Predicted = pred$class, Actual = iris$Species)

Cross-Validation

Training accuracy is always optimistic. Always report cross-validated accuracy.

Leave-one-out CV

lda() has built-in leave-one-out cross-validation:

lda_cv <- lda(Species ~ ., data = iris, CV = TRUE)

table(Predicted = lda_cv$class, Actual = iris$Species)
mean(lda_cv$class == iris$Species)
#> 0.98

k-fold CV with caret

library(caret)

set.seed(123)
cv_model <- train(Species ~ .,
                  data = iris,
                  method = "lda",
                  trControl = trainControl(method = "cv", number = 10))
print(cv_model)

Train-test split

For larger datasets where LOOCV is computationally expensive:

set.seed(123)
train_idx  <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train_data <- iris[train_idx, ]
test_data  <- iris[-train_idx, ]

lda_train <- lda(Species ~ ., data = train_data)
test_pred <- predict(lda_train, newdata = test_data)

mean(test_pred$class == test_data$Species)

Common Pitfalls

Testing on training data. Training accuracy is inflated by overfitting. Always use cross-validation or a held-out test set.

More variables than samples. LDA breaks down when p approaches n. Run PCA first and use the PC scores as input.

Not scaling. Variables on different scales distort loadings. Use scale() before fitting unless all predictors share the same unit.

Unbalanced groups. LDA uses prior probabilities equal to group sizes by default. With very unequal groups, the larger group dominates predictions. Adjust priors if needed:

lda(groups ~ ., data = data, prior = c(0.5, 0.5))

Exercise

Using the iris dataset:

Fit LDA and plot samples in LD1-LD2 space coloured by species
Which species is hardest to separate, and which variable is most responsible?
Run LOOCV and report accuracy
Compare the LD1-LD2 plot to the PCA PC1-PC2 plot: which separates the species more cleanly?

Solution

library(MASS)
library(ggplot2)
library(gridExtra)

# 1. Fit and plot
lda_model <- lda(Species ~ ., data = iris)
pred <- predict(lda_model)

lda_df <- data.frame(pred$x, Species = iris$Species)
p_lda <- ggplot(lda_df, aes(LD1, LD2, colour = Species)) +
  geom_point(size = 2) +
  stat_ellipse() +
  ggtitle("LDA") +
  theme_minimal()

# 2. Hardest to separate: versicolor and virginica overlap on LD1
# Petal.Length and Petal.Width have the largest loadings
lda_model$scaling

# 3. LOOCV
lda_cv <- lda(Species ~ ., data = iris, CV = TRUE)
mean(lda_cv$class == iris$Species)
table(lda_cv$class, iris$Species)

# 4. PCA comparison
pca <- prcomp(iris[, 1:4], scale. = TRUE)
pca_df <- data.frame(pca$x[, 1:2], Species = iris$Species)
p_pca <- ggplot(pca_df, aes(PC1, PC2, colour = Species)) +
  geom_point(size = 2) +
  stat_ellipse() +
  ggtitle("PCA") +
  theme_minimal()

grid.arrange(p_lda, p_pca, ncol = 2)
# LDA separates versicolor and virginica more clearly