Skip to content

Logistic Regression

Learning Objectives

By the end of this section, you should be able to:

  • Understand when to use logistic regression for classification
  • Interpret odds ratios and coefficients
  • Apply logistic regression in R
  • Evaluate model performance with ROC curves and confusion matrices
  • Understand the difference between logistic regression and LDA

What is Logistic Regression?

Logistic regression models the probability that an observation belongs to a particular category based on predictor variables.

Key Concept

Logistic regression predicts probabilities (0 to 1) that are then converted to class predictions.

Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of a binary outcome (Yes/No, Disease/Healthy, Present/Absent).

When to Use Logistic Regression

  • Binary classification - Two outcome categories
  • Interpretable models - Need to explain effects of predictors
  • Probability estimates - Want confidence in predictions
  • Testing hypotheses - Statistical significance of predictors
  • Small to moderate datasets - Works well with n=50-1000s

Common applications:

  • Disease diagnosis (disease vs. healthy)
  • Gene presence/absence prediction
  • Microbiome: presence of pathogen
  • Binary outcomes in experiments (success/failure)
  • Ecological: species presence/absence

The Logistic Function

Logistic regression uses the logistic (sigmoid) function to map predictions to probabilities:

$$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...)}} $$

Where:

  • P(Y=1|X) = probability of outcome being 1 given predictors X
  • \beta_0 = intercept
  • \beta_1, \beta_2, ... = coefficients for predictors
  • e = Euler's number (2.718...)

Visualizing the Logistic Curve

# Generate data
x <- seq(-6, 6, 0.1)
p <- 1 / (1 + exp(-x))

# Plot
plot(x, p, type = "l", lwd = 2, col = "blue",
     xlab = "Linear Predictor (β₀ + β₁X)",
     ylab = "Probability P(Y=1)",
     main = "Logistic Function")
abline(h = 0.5, lty = 2, col = "red")
abline(v = 0, lty = 2, col = "red")

Key properties:

  • Output is always between 0 and 1
  • S-shaped curve
  • When linear predictor = 0, probability = 0.5
  • Steep change around decision boundary

The Logit Transformation

To make the model linear, we use the logit (log-odds):

$$ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... $$

Where:

  • \frac{p}{1-p} = odds of event occurring
  • \log(\text{odds}) = log-odds (logit)

Odds vs. Probability

  • Probability: P(event) = 0.75 means 75% chance
  • Odds: \frac{0.75}{0.25} = 3 means "3 to 1 odds"
  • Log-odds: \log(3) = 1.099

Logistic Regression in R

Basic Example

# Simulate data: disease diagnosis based on age
set.seed(123)
n <- 200
age <- rnorm(n, mean = 50, sd = 15)
disease <- rbinom(n, 1, prob = 1/(1 + exp(-(age - 50)/10)))

# Create dataframe
df <- data.frame(age = age, disease = disease)

# Fit logistic regression
model <- glm(disease ~ age, data = df, family = binomial(link = "logit"))

# View results
summary(model)

Understanding the Output

# Coefficients
coef(model)
#>  (Intercept)          age 
#>   -5.234567    0.098765

# Interpretation:
# For each 1-year increase in age, 
# the log-odds of disease increase by 0.098765

# Convert to odds ratio
exp(coef(model))
#>  (Intercept)          age 
#>    0.0053421    1.1038214

# Interpretation:
# For each 1-year increase in age,
# the odds of disease multiply by 1.10 (10% increase)

Making Predictions

# Predict probabilities
new_data <- data.frame(age = c(30, 50, 70))
predictions <- predict(model, newdata = new_data, type = "response")

print(predictions)
#>         1         2         3 
#> 0.01823   0.50000   0.98177

# Predict classes (using 0.5 threshold)
predicted_class <- ifelse(predictions > 0.5, 1, 0)

Multiple Predictors

Example: Disease Prediction

# Simulate data with multiple predictors
set.seed(456)
n <- 300
age <- rnorm(n, 50, 15)
bmi <- rnorm(n, 25, 5)
smoker <- rbinom(n, 1, 0.3)

# Generate outcome
linear_pred <- -8 + 0.05*age + 0.1*bmi + 1.5*smoker
prob <- 1 / (1 + exp(-linear_pred))
disease <- rbinom(n, 1, prob)

# Create dataframe
df <- data.frame(age = age, bmi = bmi, smoker = smoker, disease = disease)

# Fit model
model_multi <- glm(disease ~ age + bmi + smoker, 
                   data = df, 
                   family = binomial)

summary(model_multi)

# Odds ratios
exp(coef(model_multi))
#>  (Intercept)          age          bmi       smoker 
#>    0.0003355    1.0512345    1.1053210    4.4816890

# Interpretation:
# - Each year of age: 5.1% increase in odds
# - Each BMI point: 10.5% increase in odds
# - Being a smoker: 348% increase in odds (4.48x)

Statistical Significance

# Wald test (default in summary)
summary(model_multi)$coefficients
#>               Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -8.05234    1.23456  -6.523  7.0e-11
#> age           0.04987    0.01234   4.041  5.3e-05
#> bmi           0.10012    0.02345   4.270  1.9e-05
#> smoker        1.50000    0.34567   4.340  1.4e-05

# Likelihood ratio test
library(lmtest)
lrtest(model_multi)

# Compare nested models
model_reduced <- glm(disease ~ age + bmi, data = df, family = binomial)
anova(model_reduced, model_multi, test = "Chisq")

Model Evaluation

Confusion Matrix

# Predictions on training data
predicted_probs <- predict(model_multi, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)

# Confusion matrix
table(Predicted = predicted_class, Actual = df$disease)
#>          Actual
#> Predicted   0   1
#>         0 180  15
#>         1  20  85

# Accuracy
mean(predicted_class == df$disease)
#> [1] 0.883

Performance Metrics

library(caret)

# Comprehensive metrics
confusionMatrix(factor(predicted_class), 
                factor(df$disease), 
                positive = "1")
#> Accuracy : 0.883
#> Sensitivity : 0.850  (True Positive Rate)
#> Specificity : 0.900  (True Negative Rate)
#> Precision : 0.810   (Positive Predictive Value)

ROC Curve and AUC

library(pROC)

# Create ROC object
roc_obj <- roc(df$disease, predicted_probs)

# Plot ROC curve
plot(roc_obj, main = "ROC Curve", 
     print.auc = TRUE, 
     col = "blue", lwd = 2)

# AUC (Area Under Curve)
auc(roc_obj)
#> Area under the curve: 0.925

# Interpretation:
# AUC = 0.925 means 92.5% chance that a random diseased
# individual has higher predicted probability than a random healthy one

AUC Interpretation:

  • 0.90-1.0 - Excellent discrimination
  • 0.80-0.90 - Good discrimination
  • 0.70-0.80 - Acceptable discrimination
  • 0.60-0.70 - Poor discrimination
  • 0.50-0.60 - Fail (no better than random)

Cross-Validation

K-Fold Cross-Validation

library(caret)

# Set up 10-fold CV
train_control <- trainControl(method = "cv", 
                              number = 10,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

# Train model
set.seed(789)
cv_model <- train(disease ~ age + bmi + smoker,
                  data = df,
                  method = "glm",
                  family = binomial,
                  trControl = train_control,
                  metric = "ROC")

# Results
print(cv_model)
#> ROC        Sens       Spec     
#> 0.9123     0.8421     0.8947

# Variable importance
varImp(cv_model)

Train-Test Split

# Split data
set.seed(123)
train_idx <- createDataPartition(df$disease, p = 0.7, list = FALSE)
train_data <- df[train_idx, ]
test_data <- df[-train_idx, ]

# Train on training set
model_train <- glm(disease ~ age + bmi + smoker,
                   data = train_data,
                   family = binomial)

# Evaluate on test set
test_probs <- predict(model_train, newdata = test_data, type = "response")
test_pred <- ifelse(test_probs > 0.5, 1, 0)

# Test set performance
confusionMatrix(factor(test_pred), factor(test_data$disease))

Model Diagnostics

Checking Assumptions

Logistic regression assumptions:

  1. Binary outcome - Response is 0 or 1
  2. Independence - Observations are independent
  3. Linear relationship - Between logit and predictors
  4. No multicollinearity - Predictors not highly correlated
  5. Large sample size - At least 10 events per predictor

Diagnostic Plots

# 1. Check linearity of logit
# Box-Tidwell test for continuous predictors
library(car)
boxTidwell(disease ~ age + bmi, data = df)

# 2. Check for influential points
# Cook's distance
plot(model_multi, which = 4)

# 3. Residual plot
plot(model_multi, which = 1)

# 4. Check multicollinearity
library(car)
vif(model_multi)
#> Should be < 5 (ideally < 3)

Hosmer-Lemeshow Test

Tests goodness of fit:

library(ResourceSelection)

# Hosmer-Lemeshow test
hoslem.test(df$disease, fitted(model_multi))
#> p-value > 0.05 indicates good fit

Logistic Regression vs. LDA

Feature Logistic Regression LDA
Assumptions Fewer (just linearity of logit) Multivariate normality, equal covariances
Output Probabilities and coefficients Discriminant functions
Interpretation Direct: odds ratios Indirect: loadings
Flexibility More robust to violations More efficient when assumptions met
Sample size Better for small n Needs adequate n
Extensions Regularization (LASSO, Ridge) QDA for unequal covariances

When to Choose

  • Logistic Regression: Default for binary outcomes, especially with non-normal data
  • LDA: When assumptions are met and you have multiple groups
  • Both: Try both and compare performance!

Extensions

Multinomial Logistic Regression

For more than 2 categories (unordered):

library(nnet)

# Simulate 3-category outcome
category <- sample(c("A", "B", "C"), 300, replace = TRUE)
df$category <- category

# Fit multinomial model
multi_model <- multinom(category ~ age + bmi, data = df)

# Predictions
predict(multi_model, newdata = new_data, type = "probs")

Ordinal Logistic Regression

For ordered categories (e.g., disease severity: mild < moderate < severe):

library(MASS)

# Simulate ordered outcome
severity <- factor(sample(c("mild", "moderate", "severe"), 300, replace = TRUE),
                   levels = c("mild", "moderate", "severe"),
                   ordered = TRUE)
df$severity <- severity

# Fit proportional odds model
ord_model <- polr(severity ~ age + bmi, data = df)

# Summary
summary(ord_model)

Regularized Logistic Regression

For high-dimensional data or preventing overfitting:

library(glmnet)

# Prepare data
X <- as.matrix(df[, c("age", "bmi", "smoker")])
y <- df$disease

# LASSO (L1 regularization)
lasso_model <- cv.glmnet(X, y, family = "binomial", alpha = 1)

# Ridge (L2 regularization)
ridge_model <- cv.glmnet(X, y, family = "binomial", alpha = 0)

# Elastic Net (combination)
elastic_model <- cv.glmnet(X, y, family = "binomial", alpha = 0.5)

# Coefficients at optimal lambda
coef(lasso_model, s = "lambda.min")

# Predictions
predict(lasso_model, newx = X, s = "lambda.min", type = "response")

Biological Example: Microbiome

Predicting Disease from Microbiome Data

library(phyloseq)
library(glmnet)

# Example data
data(GlobalPatterns)
gp <- GlobalPatterns

# Create binary outcome (e.g., human vs. non-human)
sample_data(gp)$is_human <- ifelse(
  sample_data(gp)$SampleType %in% c("Feces", "Skin", "Tongue"),
  1, 0
)

# Extract and transform OTU data
otu <- as(otu_table(gp), "matrix")
if(taxa_are_rows(gp)) otu <- t(otu)

# Filter rare taxa
otu_filtered <- otu[, colSums(otu > 0) > 5]

# CLR transformation (compositional data)
otu_clr <- t(apply(otu_filtered + 1, 1, function(x) log(x) - mean(log(x))))

# Outcome
outcome <- sample_data(gp)$is_human

# LASSO logistic regression (handles many predictors)
cv_fit <- cv.glmnet(otu_clr, outcome, family = "binomial", alpha = 1)

# Plot cross-validation
plot(cv_fit)

# Important taxa (non-zero coefficients)
coef_matrix <- coef(cv_fit, s = "lambda.min")
important_taxa <- rownames(coef_matrix)[coef_matrix[, 1] != 0]
print(important_taxa)

# Predictions
predictions <- predict(cv_fit, newx = otu_clr, 
                      s = "lambda.min", type = "response")

# ROC
library(pROC)
roc_result <- roc(outcome, predictions[, 1])
plot(roc_result, main = "Microbiome-based Classification")
auc(roc_result)

Common Pitfalls

Mistakes to Avoid

1. Not checking for complete separation

# If predictor perfectly predicts outcome, 
# coefficients will be infinite
# Check warnings in summary()

# Solution: Penalized regression (Firth's method)
library(logistf)
model_firth <- logistf(disease ~ predictor, data = df)

2. Wrong link function

# BAD: Using identity link (linear regression)
glm(disease ~ age, family = gaussian)

# GOOD: Using logit link
glm(disease ~ age, family = binomial(link = "logit"))

3. Interpreting coefficients as probabilities

# Coefficient = 0.5 does NOT mean 50% increase in probability!
# It means exp(0.5) = 1.65x increase in ODDS

# Always convert to odds ratios
exp(coef(model))

4. Not checking for multicollinearity

# BAD: Highly correlated predictors

# GOOD: Check VIF
library(car)
vif(model)  # Should be < 5

5. Using training accuracy only

# BAD: Overfitting to training data

# GOOD: Use cross-validation or test set
cv_model <- train(..., trControl = trainControl(method = "cv"))


Practical Workflow

Complete logistic regression analysis:

library(caret)
library(pROC)
library(car)

# 1. Load and prepare data
data <- your_data
data$outcome <- factor(data$outcome, levels = c(0, 1))

# 2. Exploratory analysis
summary(data)
table(data$outcome)  # Check balance

# 3. Check for multicollinearity
cor(data[, predictors])

# 4. Split data
set.seed(123)
train_idx <- createDataPartition(data$outcome, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# 5. Fit model
model <- glm(outcome ~ predictor1 + predictor2 + predictor3,
             data = train_data,
             family = binomial)

# 6. Check for warnings (separation, convergence)
summary(model)

# 7. Check VIF
vif(model)

# 8. Odds ratios and 95% CI
exp(cbind(OR = coef(model), confint(model)))

# 9. Cross-validation
train_control <- trainControl(method = "cv", number = 10,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

cv_model <- train(outcome ~ .,
                  data = train_data[, c("outcome", predictors)],
                  method = "glm",
                  family = binomial,
                  trControl = train_control)

print(cv_model)

# 10. Test set evaluation
test_probs <- predict(model, newdata = test_data, type = "response")
test_pred <- ifelse(test_probs > 0.5, 1, 0)

# 11. Performance metrics
confusionMatrix(factor(test_pred), test_data$outcome)

# 12. ROC curve
roc_test <- roc(test_data$outcome, test_probs)
plot(roc_test)
auc(roc_test)

# 13. Save results
saveRDS(model, "logistic_model.rds")
write.csv(data.frame(predicted = test_probs, 
                     actual = test_data$outcome),
          "predictions.csv")

Exercises

Practice Problems

Exercise 1: Basic Logistic Regression

Use the built-in mtcars dataset to predict transmission type (am: 0=automatic, 1=manual):

  1. Fit a logistic model with mpg, hp, and wt as predictors
  2. Interpret the odds ratios
  3. Which variables are significant?
  4. Calculate the accuracy
Solution
# 1. Fit model
model <- glm(am ~ mpg + hp + wt, 
            data = mtcars, 
            family = binomial)

# 2. Odds ratios
exp(coef(model))
#> mpg: For each 1 mpg increase, odds multiply by X
#> hp: For each 1 hp increase, odds multiply by Y
#> wt: For each 1000 lb increase, odds multiply by Z

# 3. Significance
summary(model)$coefficients[, "Pr(>|z|)"]
# p < 0.05 are significant

# 4. Accuracy
predictions <- ifelse(predict(model, type = "response") > 0.5, 1, 0)
mean(predictions == mtcars$am)

Exercise 2: Model Comparison

Compare three models for mtcars$am:

  1. Model A: Only mpg
  2. Model B: mpg + hp
  3. Model C: mpg + hp + wt

Which is best? Use AIC and likelihood ratio tests.

Solution
# Fit models
model_a <- glm(am ~ mpg, data = mtcars, family = binomial)
model_b <- glm(am ~ mpg + hp, data = mtcars, family = binomial)
model_c <- glm(am ~ mpg + hp + wt, data = mtcars, family = binomial)

# Compare AIC (lower is better)
AIC(model_a, model_b, model_c)

# Likelihood ratio tests
anova(model_a, model_b, model_c, test = "Chisq")

# Best model is the one with:
# - Lowest AIC
# - Significant improvement in LRT
# - Good interpretability/parsimony

Key Takeaways

Remember These Concepts

Logistic regression predicts probabilities of binary outcomes
Coefficients represent change in log-odds; exp(coef) gives odds ratios
Odds ratio > 1 means increased odds; < 1 means decreased odds
Always use cross-validation or test set to avoid overfitting
ROC/AUC measure discriminative ability across all thresholds
Check assumptions: linearity of logit, no multicollinearity, independence
Use regularization (LASSO/Ridge) for high-dimensional data
Interpret carefully: coefficients are on log-odds scale


Further Resources

Textbooks

  • Hosmer, D.W. et al. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.

R Packages

  • stats - glm() function
  • caret - Unified ML interface
  • pROC - ROC curve analysis
  • glmnet - Regularized regression

Online Resources


Logistic Regression in Practice

Logistic regression is one of the most widely used classification methods because it's:

  • Interpretable - coefficients have clear meaning
  • Fast - quick to fit even with large datasets
  • Robust - works well even when assumptions are slightly violated
  • Extensible - easily handles categorical predictors, interactions, regularization