Logistic Regression

Learning Objectives

By the end of this section, you should be able to:

Understand when to use logistic regression for classification
Interpret odds ratios and coefficients
Apply logistic regression in R
Evaluate model performance with ROC curves and confusion matrices
Understand the difference between logistic regression and LDA

What is Logistic Regression?

Logistic regression models the probability that an observation belongs to a particular category based on predictor variables.

Key Concept

Logistic regression predicts probabilities (0 to 1) that are then converted to class predictions.

Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of a binary outcome (Yes/No, Disease/Healthy, Present/Absent).

When to Use Logistic Regression

Binary classification - Two outcome categories
Interpretable models - Need to explain effects of predictors
Probability estimates - Want confidence in predictions
Testing hypotheses - Statistical significance of predictors
Small to moderate datasets - Works well with n=50-1000s

Common applications:

Disease diagnosis (disease vs. healthy)
Gene presence/absence prediction
Microbiome: presence of pathogen
Binary outcomes in experiments (success/failure)
Ecological: species presence/absence

The Logistic Function

Logistic regression uses the logistic (sigmoid) function to map predictions to probabilities:

$$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...)}} $$

Where:

$P(Y=1|X)$ = probability of outcome being 1 given predictors X
$\beta_0$ = intercept
$\beta_1, \beta_2, ...$ = coefficients for predictors
$e$ = Euler's number (2.718...)

Visualizing the Logistic Curve

# Generate data
x <- seq(-6, 6, 0.1)
p <- 1 / (1 + exp(-x))

# Plot
plot(x, p, type = "l", lwd = 2, col = "blue",
     xlab = "Linear Predictor (β₀ + β₁X)",
     ylab = "Probability P(Y=1)",
     main = "Logistic Function")
abline(h = 0.5, lty = 2, col = "red")
abline(v = 0, lty = 2, col = "red")

Key properties:

Output is always between 0 and 1
S-shaped curve
When linear predictor = 0, probability = 0.5
Steep change around decision boundary

The Logit Transformation

To make the model linear, we use the logit (log-odds):

$$ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... $$

Where:

$\frac{p}{1-p}$ = odds of event occurring
$\log(\text{odds})$ = log-odds (logit)

Odds vs. Probability

Probability: P(event) = 0.75 means 75% chance
Odds: $\frac{0.75}{0.25} = 3$ means "3 to 1 odds"
Log-odds: $\log(3) = 1.099$

Logistic Regression in R

Basic Example

# Simulate data: disease diagnosis based on age
set.seed(123)
n <- 200
age <- rnorm(n, mean = 50, sd = 15)
disease <- rbinom(n, 1, prob = 1/(1 + exp(-(age - 50)/10)))

# Create dataframe
df <- data.frame(age = age, disease = disease)

# Fit logistic regression
model <- glm(disease ~ age, data = df, family = binomial(link = "logit"))

# View results
summary(model)

Understanding the Output

# Coefficients
coef(model)
#>  (Intercept)          age 
#>   -5.234567    0.098765

# Interpretation:
# For each 1-year increase in age, 
# the log-odds of disease increase by 0.098765

# Convert to odds ratio
exp(coef(model))
#>  (Intercept)          age 
#>    0.0053421    1.1038214

# Interpretation:
# For each 1-year increase in age,
# the odds of disease multiply by 1.10 (10% increase)

Making Predictions

# Predict probabilities
new_data <- data.frame(age = c(30, 50, 70))
predictions <- predict(model, newdata = new_data, type = "response")

print(predictions)
#>         1         2         3 
#> 0.01823   0.50000   0.98177

# Predict classes (using 0.5 threshold)
predicted_class <- ifelse(predictions > 0.5, 1, 0)

Multiple Predictors

Example: Disease Prediction

# Simulate data with multiple predictors
set.seed(456)
n <- 300
age <- rnorm(n, 50, 15)
bmi <- rnorm(n, 25, 5)
smoker <- rbinom(n, 1, 0.3)

# Generate outcome
linear_pred <- -8 + 0.05*age + 0.1*bmi + 1.5*smoker
prob <- 1 / (1 + exp(-linear_pred))
disease <- rbinom(n, 1, prob)

# Create dataframe
df <- data.frame(age = age, bmi = bmi, smoker = smoker, disease = disease)

# Fit model
model_multi <- glm(disease ~ age + bmi + smoker, 
                   data = df, 
                   family = binomial)

summary(model_multi)

# Odds ratios
exp(coef(model_multi))
#>  (Intercept)          age          bmi       smoker 
#>    0.0003355    1.0512345    1.1053210    4.4816890

# Interpretation:
# - Each year of age: 5.1% increase in odds
# - Each BMI point: 10.5% increase in odds
# - Being a smoker: 348% increase in odds (4.48x)

Statistical Significance

# Wald test (default in summary)
summary(model_multi)$coefficients
#>               Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -8.05234    1.23456  -6.523  7.0e-11
#> age           0.04987    0.01234   4.041  5.3e-05
#> bmi           0.10012    0.02345   4.270  1.9e-05
#> smoker        1.50000    0.34567   4.340  1.4e-05

# Likelihood ratio test
library(lmtest)
lrtest(model_multi)

# Compare nested models
model_reduced <- glm(disease ~ age + bmi, data = df, family = binomial)
anova(model_reduced, model_multi, test = "Chisq")

Model Evaluation

Confusion Matrix

# Predictions on training data
predicted_probs <- predict(model_multi, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)

# Confusion matrix
table(Predicted = predicted_class, Actual = df$disease)
#>          Actual
#> Predicted   0   1
#>         0 180  15
#>         1  20  85

# Accuracy
mean(predicted_class == df$disease)
#> [1] 0.883

Performance Metrics

library(caret)

# Comprehensive metrics
confusionMatrix(factor(predicted_class), 
                factor(df$disease), 
                positive = "1")
#> Accuracy : 0.883
#> Sensitivity : 0.850  (True Positive Rate)
#> Specificity : 0.900  (True Negative Rate)
#> Precision : 0.810   (Positive Predictive Value)

ROC Curve and AUC

library(pROC)

# Create ROC object
roc_obj <- roc(df$disease, predicted_probs)

# Plot ROC curve
plot(roc_obj, main = "ROC Curve", 
     print.auc = TRUE, 
     col = "blue", lwd = 2)

# AUC (Area Under Curve)
auc(roc_obj)
#> Area under the curve: 0.925

# Interpretation:
# AUC = 0.925 means 92.5% chance that a random diseased
# individual has higher predicted probability than a random healthy one

AUC Interpretation:

0.90-1.0 - Excellent discrimination
0.80-0.90 - Good discrimination
0.70-0.80 - Acceptable discrimination
0.60-0.70 - Poor discrimination
0.50-0.60 - Fail (no better than random)

Cross-Validation

K-Fold Cross-Validation

library(caret)

# Set up 10-fold CV
train_control <- trainControl(method = "cv", 
                              number = 10,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

# Train model
set.seed(789)
cv_model <- train(disease ~ age + bmi + smoker,
                  data = df,
                  method = "glm",
                  family = binomial,
                  trControl = train_control,
                  metric = "ROC")

# Results
print(cv_model)
#> ROC        Sens       Spec     
#> 0.9123     0.8421     0.8947

# Variable importance
varImp(cv_model)

Train-Test Split

# Split data
set.seed(123)
train_idx <- createDataPartition(df$disease, p = 0.7, list = FALSE)
train_data <- df[train_idx, ]
test_data <- df[-train_idx, ]

# Train on training set
model_train <- glm(disease ~ age + bmi + smoker,
                   data = train_data,
                   family = binomial)

# Evaluate on test set
test_probs <- predict(model_train, newdata = test_data, type = "response")
test_pred <- ifelse(test_probs > 0.5, 1, 0)

# Test set performance
confusionMatrix(factor(test_pred), factor(test_data$disease))

Model Diagnostics

Checking Assumptions

Logistic regression assumptions:

Binary outcome - Response is 0 or 1
Independence - Observations are independent
Linear relationship - Between logit and predictors
No multicollinearity - Predictors not highly correlated
Large sample size - At least 10 events per predictor

Diagnostic Plots

# 1. Check linearity of logit
# Box-Tidwell test for continuous predictors
library(car)
boxTidwell(disease ~ age + bmi, data = df)

# 2. Check for influential points
# Cook's distance
plot(model_multi, which = 4)

# 3. Residual plot
plot(model_multi, which = 1)

# 4. Check multicollinearity
library(car)
vif(model_multi)
#> Should be < 5 (ideally < 3)

Hosmer-Lemeshow Test

Tests goodness of fit:

library(ResourceSelection)

# Hosmer-Lemeshow test
hoslem.test(df$disease, fitted(model_multi))
#> p-value > 0.05 indicates good fit

Logistic Regression vs. LDA

Feature	Logistic Regression	LDA
Assumptions	Fewer (just linearity of logit)	Multivariate normality, equal covariances
Output	Probabilities and coefficients	Discriminant functions
Interpretation	Direct: odds ratios	Indirect: loadings
Flexibility	More robust to violations	More efficient when assumptions met
Sample size	Better for small n	Needs adequate n
Extensions	Regularization (LASSO, Ridge)	QDA for unequal covariances

When to Choose

Logistic Regression: Default for binary outcomes, especially with non-normal data
LDA: When assumptions are met and you have multiple groups
Both: Try both and compare performance!

Extensions

Multinomial Logistic Regression

For more than 2 categories (unordered):

library(nnet)

# Simulate 3-category outcome
category <- sample(c("A", "B", "C"), 300, replace = TRUE)
df$category <- category

# Fit multinomial model
multi_model <- multinom(category ~ age + bmi, data = df)

# Predictions
predict(multi_model, newdata = new_data, type = "probs")

Ordinal Logistic Regression

For ordered categories (e.g., disease severity: mild < moderate < severe):

library(MASS)

# Simulate ordered outcome
severity <- factor(sample(c("mild", "moderate", "severe"), 300, replace = TRUE),
                   levels = c("mild", "moderate", "severe"),
                   ordered = TRUE)
df$severity <- severity

# Fit proportional odds model
ord_model <- polr(severity ~ age + bmi, data = df)

# Summary
summary(ord_model)

Regularized Logistic Regression

For high-dimensional data or preventing overfitting:

library(glmnet)

# Prepare data
X <- as.matrix(df[, c("age", "bmi", "smoker")])
y <- df$disease

# LASSO (L1 regularization)
lasso_model <- cv.glmnet(X, y, family = "binomial", alpha = 1)

# Ridge (L2 regularization)
ridge_model <- cv.glmnet(X, y, family = "binomial", alpha = 0)

# Elastic Net (combination)
elastic_model <- cv.glmnet(X, y, family = "binomial", alpha = 0.5)

# Coefficients at optimal lambda
coef(lasso_model, s = "lambda.min")

# Predictions
predict(lasso_model, newx = X, s = "lambda.min", type = "response")

Biological Example: Microbiome

Predicting Disease from Microbiome Data

library(phyloseq)
library(glmnet)

# Example data
data(GlobalPatterns)
gp <- GlobalPatterns

# Create binary outcome (e.g., human vs. non-human)
sample_data(gp)$is_human <- ifelse(
  sample_data(gp)$SampleType %in% c("Feces", "Skin", "Tongue"),
  1, 0
)

# Extract and transform OTU data
otu <- as(otu_table(gp), "matrix")
if(taxa_are_rows(gp)) otu <- t(otu)

# Filter rare taxa
otu_filtered <- otu[, colSums(otu > 0) > 5]

# CLR transformation (compositional data)
otu_clr <- t(apply(otu_filtered + 1, 1, function(x) log(x) - mean(log(x))))

# Outcome
outcome <- sample_data(gp)$is_human

# LASSO logistic regression (handles many predictors)
cv_fit <- cv.glmnet(otu_clr, outcome, family = "binomial", alpha = 1)

# Plot cross-validation
plot(cv_fit)

# Important taxa (non-zero coefficients)
coef_matrix <- coef(cv_fit, s = "lambda.min")
important_taxa <- rownames(coef_matrix)[coef_matrix[, 1] != 0]
print(important_taxa)

# Predictions
predictions <- predict(cv_fit, newx = otu_clr, 
                      s = "lambda.min", type = "response")

# ROC
library(pROC)
roc_result <- roc(outcome, predictions[, 1])
plot(roc_result, main = "Microbiome-based Classification")
auc(roc_result)

Common Pitfalls

Mistakes to Avoid

1. Not checking for complete separation

# If predictor perfectly predicts outcome, 
# coefficients will be infinite
# Check warnings in summary()

# Solution: Penalized regression (Firth's method)
library(logistf)
model_firth <- logistf(disease ~ predictor, data = df)

2. Wrong link function

# BAD: Using identity link (linear regression)
glm(disease ~ age, family = gaussian)

# GOOD: Using logit link
glm(disease ~ age, family = binomial(link = "logit"))

3. Interpreting coefficients as probabilities

# Coefficient = 0.5 does NOT mean 50% increase in probability!
# It means exp(0.5) = 1.65x increase in ODDS

# Always convert to odds ratios
exp(coef(model))

4. Not checking for multicollinearity

# BAD: Highly correlated predictors

# GOOD: Check VIF
library(car)
vif(model)  # Should be < 5

5. Using training accuracy only

# BAD: Overfitting to training data

# GOOD: Use cross-validation or test set
cv_model <- train(..., trControl = trainControl(method = "cv"))

Practical Workflow

Complete logistic regression analysis:

library(caret)
library(pROC)
library(car)

# 1. Load and prepare data
data <- your_data
data$outcome <- factor(data$outcome, levels = c(0, 1))

# 2. Exploratory analysis
summary(data)
table(data$outcome)  # Check balance

# 3. Check for multicollinearity
cor(data[, predictors])

# 4. Split data
set.seed(123)
train_idx <- createDataPartition(data$outcome, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# 5. Fit model
model <- glm(outcome ~ predictor1 + predictor2 + predictor3,
             data = train_data,
             family = binomial)

# 6. Check for warnings (separation, convergence)
summary(model)

# 7. Check VIF
vif(model)

# 8. Odds ratios and 95% CI
exp(cbind(OR = coef(model), confint(model)))

# 9. Cross-validation
train_control <- trainControl(method = "cv", number = 10,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

cv_model <- train(outcome ~ .,
                  data = train_data[, c("outcome", predictors)],
                  method = "glm",
                  family = binomial,
                  trControl = train_control)

print(cv_model)

# 10. Test set evaluation
test_probs <- predict(model, newdata = test_data, type = "response")
test_pred <- ifelse(test_probs > 0.5, 1, 0)

# 11. Performance metrics
confusionMatrix(factor(test_pred), test_data$outcome)

# 12. ROC curve
roc_test <- roc(test_data$outcome, test_probs)
plot(roc_test)
auc(roc_test)

# 13. Save results
saveRDS(model, "logistic_model.rds")
write.csv(data.frame(predicted = test_probs, 
                     actual = test_data$outcome),
          "predictions.csv")

Exercises

Practice Problems

Exercise 1: Basic Logistic Regression

Use the built-in mtcars dataset to predict transmission type (am: 0=automatic, 1=manual):

Fit a logistic model with mpg, hp, and wt as predictors
Interpret the odds ratios
Which variables are significant?
Calculate the accuracy

Solution

# 1. Fit model
model <- glm(am ~ mpg + hp + wt, 
            data = mtcars, 
            family = binomial)

# 2. Odds ratios
exp(coef(model))
#> mpg: For each 1 mpg increase, odds multiply by X
#> hp: For each 1 hp increase, odds multiply by Y
#> wt: For each 1000 lb increase, odds multiply by Z

# 3. Significance
summary(model)$coefficients[, "Pr(>|z|)"]
# p < 0.05 are significant

# 4. Accuracy
predictions <- ifelse(predict(model, type = "response") > 0.5, 1, 0)
mean(predictions == mtcars$am)

Exercise 2: Model Comparison

Compare three models for mtcars$am:

Model A: Only mpg
Model B: mpg + hp
Model C: mpg + hp + wt

Which is best? Use AIC and likelihood ratio tests.

Solution

# Fit models
model_a <- glm(am ~ mpg, data = mtcars, family = binomial)
model_b <- glm(am ~ mpg + hp, data = mtcars, family = binomial)
model_c <- glm(am ~ mpg + hp + wt, data = mtcars, family = binomial)

# Compare AIC (lower is better)
AIC(model_a, model_b, model_c)

# Likelihood ratio tests
anova(model_a, model_b, model_c, test = "Chisq")

# Best model is the one with:
# - Lowest AIC
# - Significant improvement in LRT
# - Good interpretability/parsimony

Key Takeaways

Remember These Concepts

✓ Logistic regression predicts probabilities of binary outcomes
✓ Coefficients represent change in log-odds; exp(coef) gives odds ratios
✓ Odds ratio > 1 means increased odds; < 1 means decreased odds
✓ Always use cross-validation or test set to avoid overfitting
✓ ROC/AUC measure discriminative ability across all thresholds
✓ Check assumptions: linearity of logit, no multicollinearity, independence
✓ Use regularization (LASSO/Ridge) for high-dimensional data
✓ Interpret carefully: coefficients are on log-odds scale

Further Resources

Textbooks

Hosmer, D.W. et al. (2013). Applied Logistic Regression (3^rd ed.). Wiley.
Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.

R Packages

stats - glm() function
caret - Unified ML interface
pROC - ROC curve analysis
glmnet - Regularized regression

Online Resources

Logistic Regression in Practice

Logistic regression is one of the most widely used classification methods because it's:

Interpretable - coefficients have clear meaning
Fast - quick to fit even with large datasets
Robust - works well even when assumptions are slightly violated
Extensible - easily handles categorical predictors, interactions, regularization