Skip to content

Multicollinearity: Detection and Solutions

Learning Objectives

By the end of this section, you should be able to:

  • Understand what multicollinearity is and why it's problematic
  • Detect multicollinearity in your data
  • Apply appropriate strategies to handle multicollinearity
  • Choose the right approach based on your analysis goals

From Correlation to Multicollinearity

After studying covariance and correlation, we often discover that predictors in our dataset are not independent.

Multicollinearity occurs when predictors are highly correlated with each other.

This creates problems for regression models, especially for interpretation.

Common Examples in Biology

  • Body weight and body length
  • Gene expression of genes in the same pathway
  • Environmental variables (temperature and humidity)
  • Multiple measures of the same trait (leaf length and leaf area)

Why Multicollinearity Is a Problem

When predictors are strongly correlated:

  • Standard errors increase
  • Regression coefficients become unstable
  • Small data changes → large coefficient changes
  • Individual effects are hard to interpret

Key Insight

Prediction may still work, but inference becomes unreliable.

  • If you just want predictions → multicollinearity is less of a problem
  • If you want to understand which variables matter → serious issue!

Strategy 1: Remove Redundant Variables

If two predictors carry almost the same information, keep only one.

Example

Predicting TikTok engagement using:

# These are perfectly correlated
video_length <- c(5, 10, 15, 20)  # minutes
video_duration_seconds <- c(300, 600, 900, 1200)  # seconds

cor(video_length, video_duration_seconds)
#> [1] 1.0  # Perfect correlation!

Strategy 2: Combine Variables

Combine correlated predictors into a single meaningful variable.

Example: Plant Traits

Instead of using both: - leaf_length - leaf_width

Create a composite:

# Option 1: Multiply (approximates leaf area)
leaf_size <- leaf_length * leaf_width

# Option 2: Add scaled values (equal contribution)
leaf_size <- scale(leaf_length) + scale(leaf_width)

Use the composite in your model:

model <- lm(growth ~ leaf_size + other_variables)

Strategy 3: Use PCA (Principal Component Analysis)

Replace correlated predictors with uncorrelated components.

PCA constructs new variables (PCs) that:

  • Are linear combinations of the original variables
  • Are uncorrelated by construction

Example

# Many correlated environmental predictors
X <- your_data[, c("temp", "humidity", "rainfall", "sunlight")]

# Apply PCA
pca <- prcomp(X, scale. = TRUE)
summary(pca)

# Use first components as predictors
pca_scores <- pca$x[, 1:2]  # Keep first 2 PCs
model <- lm(y ~ pca_scores)

Strategy 4: Regularization

Modify regression to penalize large coefficients.

This stabilizes estimates when predictors are correlated.

Two common approaches: - Ridge regression (L2 penalty) - Lasso regression (L1 penalty)

Ridge Regression (L2 Penalty)

Characteristics: - Shrinks coefficients toward zero - Keeps all predictors - Works well with multicollinearity

library(glmnet)

# Prepare data
X <- as.matrix(predictors)
y <- response

# Ridge regression (alpha = 0)
ridge <- glmnet(X, y, alpha = 0)

# Cross-validation to find best penalty
cv_ridge <- cv.glmnet(X, y, alpha = 0)
best_lambda <- cv_ridge$lambda.min

# Final model
ridge_final <- glmnet(X, y, alpha = 0, lambda = best_lambda)
coef(ridge_final)

Lasso Regression (L1 Penalty)

  • Shrinks coefficients
  • Sets some coefficients exactly to zero
  • Performs variable selection
# Lasso regression (alpha = 1)
lasso <- glmnet(X, y, alpha = 1)

# Cross-validation
cv_lasso <- cv.glmnet(X, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min

# Final model
lasso_final <- glmnet(X, y, alpha = 1, lambda = best_lambda)
coef(lasso_final)
#> Some coefficients are exactly 0!

Comparison of Strategies

Method Variables Kept Interpretability Stability
Remove Fewer High High
Combine Fewer Medium High
PCA Components Low Very High
Ridge All Medium High
Lasso Some High Medium

Which Strategy to Choose?

graph TD
    A[Have Multicollinearity] --> B{Primary Goal?}
    B -->|Interpretation| C[Remove or Combine or Lasso]
    B -->|Prediction| D[PCA or Ridge]

    C --> E{How many variables?}
    E -->|2-3| F[Remove or Combine]
    E -->|Many| G[Lasso]

    D --> H{Need all variables?}
    H -->|Yes| I[Ridge]
    H -->|No| J[PCA]

Detecting Multicollinearity

Before applying any strategy, first detect if you have a problem:

Variance Inflation Factor (VIF)

library(car)

# Fit your model
model <- lm(y ~ x1 + x2 + x3, data = your_data)

# Calculate VIF
vif(model)
#>       x1       x2       x3 
#>  12.453    8.921    2.134

# Interpretation:
# VIF < 5: OK
# VIF 5-10: Moderate multicollinearity
# VIF > 10: Severe multicollinearity (take action!)

Correlation Matrix

# Calculate correlations
cor_matrix <- cor(your_data)

# Visualize
library(corrplot)
corrplot(cor_matrix, method = "color", type = "upper",
         addCoef.col = "black", tl.col = "black")

# Look for correlations > 0.8 or < -0.8

Complete Workflow Example

library(car)
library(glmnet)
library(corrplot)

# 1. Detect multicollinearity
cor_matrix <- cor(predictors)
corrplot(cor_matrix, method = "number")

model_full <- lm(growth ~ ., data = df)
vif(model_full)
#> If VIF > 10, take action!

# 2. Choose strategy based on goal

# Option A: Remove redundant variable
model_reduced <- lm(growth ~ x1 + x3, data = df)  # Removed x2
vif(model_reduced)  # Check if improved

# Option B: Combine variables
df$composite <- scale(x1) + scale(x2)
model_combined <- lm(growth ~ composite + x3, data = df)

# Option C: PCA
pca <- prcomp(df[, c("x1", "x2")], scale. = TRUE)
df$PC1 <- pca$x[, 1]
model_pca <- lm(growth ~ PC1 + x3, data = df)

# Option D: Regularization
X <- as.matrix(df[, c("x1", "x2", "x3")])
y <- df$growth

cv_lasso <- cv.glmnet(X, y, alpha = 1)
lasso_final <- glmnet(X, y, alpha = 1, lambda = cv_lasso$lambda.min)

# 3. Compare models
summary(model_reduced)$r.squared
summary(model_pca)$r.squared

Key Take-Home Messages

Remember

Multicollinearity is a consequence of correlated predictors
✓ It affects interpretation, not necessarily prediction
Multiple strategies exist — choice depends on the goal
PCA and regularization are powerful multivariate tools
Always check VIF to detect multicollinearity
No single best solution — depends on your research question

Practical Advice

Most real datasets have some multicollinearity. The key is knowing when it's a problem (inference) vs. acceptable (prediction), and having the right tools to address it!


Further Resources

Textbooks

  • James, G. et al. (2021). An Introduction to Statistical Learning. Chapter 6.
  • Kutner, M.H. et al. (2005). Applied Linear Statistical Models. Chapter 7.

R Packages

Online Resources