Random Forests

Random forests are an ensemble method consisting of many decision trees trained on bootstrapped samples of the data. At each split, a random subset of predictors is considered, which decorrelates the trees. Predictions are aggregated by majority vote for classification or averaging for regression. This variance reduction makes random forests far more robust than single decision trees. They require little preprocessing, naturally capture nonlinear relationships and interactions, and are generally less prone to overfitting. Random forests are widely used in fields such as biology for tasks like sample classification and biomarker discovery.

When to Use Random Forests

Random forests are well suited to situations where linear boundaries between classes are unlikely, where many predictors interact, or where you want a ranking of variable importance alongside classification. They handle high-dimensional data (many predictors relative to samples) better than LDA or standard logistic regression, though LASSO is often preferred when interpretability of individual coefficients matters.

They are not well suited when you need a probabilistic model with confidence intervals on parameters, when the dataset is extremely small relative to the number of predictors, or when you need to explain the model to a non-technical audience in terms of individual predictor effects.

Classification in R

library(randomForest)

data(iris)
set.seed(123)

# Fit random forest classifier
rf_model <- randomForest(Species ~ .,
                         data       = iris,
                         ntree      = 500,   # number of trees
                         mtry       = 2,     # predictors tried at each split
                         importance = TRUE)  # compute variable importance

print(rf_model)
#> Call:
#>  randomForest(formula = Species ~ ., data = iris, ntree = 500,
#>               mtry = 2, importance = TRUE)
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#>
#>         OOB estimate of  error rate: 4%
#> Confusion matrix:
#>            setosa versicolor virginica class.error
#> setosa         50          0         0        0.00
#> versicolor      0         47         3        0.06
#> virginica       0          3        47        0.06

Reading the output: the OOB error rate of 4% means that on average, 4% of observations were misclassified when predicted by the trees that did not include them in their bootstrap sample. This is an honest error estimate that does not require a separate validation set.

The confusion matrix shows the breakdown by class. Setosa is classified perfectly. Three versicolor flowers are misclassified as virginica, and three virginica flowers are misclassified as versicolor, giving a class error of 6% for each. These two species overlap in feature space, which is consistent with what the PCA and LDA plots show.

Hyperparameters

Random forests have two main hyperparameters: settings that control the model structure and must be chosen before training rather than learned from the data.

ntree: the number of trees in the ensemble. Strictly speaking this is less of a hyperparameter than mtry because increasing it never hurts performance: more trees always produce a more stable model. The cost is computation time. 500 is a reasonable default; results rarely improve beyond 1000. You do not tune ntree with cross-validation, you simply set it high enough that results stop changing between runs.

mtry: the number of predictors randomly considered at each split. This is the true hyperparameter of a random forest. Too low and individual trees are weak and predictions are noisy; too high and trees become correlated with each other, reducing the main advantage of the ensemble over a single tree. The default is the square root of the total number of predictors for classification. Tuning mtry with cross-validation is recommended and covered in the section below.

Variable Importance

One of the most useful outputs of a random forest is a ranking of predictor importance. Two measures are commonly reported:

Mean decrease in accuracy: how much does prediction accuracy drop when a variable is randomly permuted? Large drops indicate important variables.

Mean decrease in Gini impurity: how much does a variable reduce node impurity across all trees? Useful but biased toward variables with many categories or continuous variables.

# Plot variable importance
varImpPlot(rf_model,
           main = "Variable Importance: Iris Species")

The plot shows two panels. The left panel ranks variables by mean decrease in accuracy: variables at the top cause the largest drop in OOB accuracy when permuted, indicating they are most important for correct classification. The right panel shows mean decrease in Gini impurity. For iris, both panels consistently rank Petal.Width and Petal.Length highest, confirming that petal dimensions drive species separation, consistent with what PCA loadings and LDA scaling show.

# Extract as a data frame
importance_df <- as.data.frame(importance(rf_model))
importance_df$Variable <- rownames(importance_df)

library(ggplot2)
ggplot(importance_df,
       aes(x = reorder(Variable, MeanDecreaseAccuracy),
           y = MeanDecreaseAccuracy)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(x = NULL, y = "Mean Decrease in Accuracy",
       title = "Random Forest Variable Importance") +
  theme_minimal()

OOB Error and Its Limitations

The OOB error is a convenient and honest estimate for the data the model was trained on. Each observation is predicted only by trees that were not trained on it, so there is no need for a separate validation set. For the iris dataset, an OOB error of 4% means the model misclassifies 4% of the flowers it was built from.

However, OOB error only reflects performance on observations drawn from the same population as the training data. If you collected new iris flowers from a different location, a different season, or measured by a different person, the true error could be higher. The model has learned the specific variation present in these 150 flowers. Biological populations vary, and a model trained on one field site, one cohort, or one laboratory may not generalise to another.

For a realistic estimate of how well the model will perform on truly new data, use a held-out test set drawn from the target population, or validate the model on an independent dataset collected separately from the training data.

Cross-Validation with caret

Cross-validation gives a more realistic performance estimate than training accuracy alone, and is also the standard way to tune the mtry hyperparameter. In k-fold cross-validation the data are split into k equal folds; the model is trained on k-1 folds and evaluated on the held-out fold, repeating k times so every observation is used for evaluation exactly once. The caret package automates this and tries each candidate value of mtry across all folds, selecting the value that gives the best average accuracy.

library(caret)

set.seed(123)
cv_rf <- train(Species ~ .,
               data      = iris,
               method    = "rf",
               trControl = trainControl(method = "cv", number = 10),
               tuneGrid  = data.frame(mtry = c(1, 2, 3, 4)))

print(cv_rf)
#> Random Forest
#> 150 samples, 4 predictors, 3 classes
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#>
#>   mtry  Accuracy  Kappa
#>   1     0.953     0.930
#>   2     0.960     0.940
#>   3     0.960     0.940
#>   4     0.947     0.920
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

Reading the output: the table shows cross-validated accuracy and Kappa for each value of mtry tested. Kappa accounts for agreement that would occur by chance alone; values above 0.8 indicate strong agreement between predicted and actual classes. Here mtry = 2 and mtry = 3 give the highest accuracy (96%), and caret selects mtry = 2 as the final model.

# Confusion matrix from cross-validated predictions
confusionMatrix(predict(cv_rf, iris), iris$Species)
#> Confusion Matrix and Statistics
#>
#>             Reference
#> Prediction   setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         47         3
#>   virginica       0          3        47
#>
#> Overall Statistics
#>  Accuracy : 0.96
#>  Kappa    : 0.94

Note that predict(cv_rf, iris) here predicts on the full training set, so this confusion matrix is slightly optimistic. The cross-validated accuracy in the print(cv_rf) table is the more reliable figure.

Biological Example: Microbiome Classification

Random forests handle the high-dimensional, compositional nature of microbiome data well, particularly when combined with CLR transformation. Variable importance scores identify which taxa best discriminate between groups.

library(randomForest)
library(microbiome)
library(phyloseq)

data(GlobalPatterns)
gp <- GlobalPatterns

# Define binary outcome: human vs. non-human
sample_data(gp)$is_human <- factor(ifelse(
  sample_data(gp)$SampleType %in% c("Feces", "Skin", "Tongue"),
  "Human", "Environment"))

# Filter and CLR-transform
gp_filtered <- filter_taxa(gp,
  function(x) sum(x > 3) > (0.1 * length(x)), TRUE)
gp_clr <- microbiome::transform(gp_filtered, "clr")

otu_clr  <- as.data.frame(t(otu_table(gp_clr)))
outcome  <- sample_data(gp_clr)$is_human

# Fit random forest
set.seed(123)
rf_micro <- randomForest(x          = otu_clr,
                         y          = outcome,
                         ntree      = 500,
                         importance = TRUE)

print(rf_micro)
#> Call:
#>  randomForest(x = otu_clr, y = outcome, ntree = 500, importance = TRUE)
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 20
#>
#>         OOB estimate of  error rate: 3.8%
#> Confusion matrix:
#>             Environment Human class.error
#> Environment          17     1      0.056
#> Human                 0     8      0.000

A low OOB error here means the model can reliably distinguish human-associated from environmental microbiome samples based on community composition. Human samples are classified perfectly; one environmental sample is misclassified as human.

# Top discriminant taxa
imp <- importance(rf_micro)
top_taxa <- rownames(imp)[order(imp[, "MeanDecreaseAccuracy"],
                                decreasing = TRUE)][1:20]
print(top_taxa)
# These are the OTU or taxon IDs with the highest mean decrease in accuracy.
# Cross-reference with your taxonomy table to identify which genera or families
# are most discriminant between human and environmental samples.

# Visualise
varImpPlot(rf_micro, n.var = 20,
           main = "Top 20 Discriminant Taxa")

Random Forests vs LDA and Logistic Regression

	Random Forest	LDA	Logistic Regression
Linear boundaries only	No	Yes	Yes
Handles interactions	Yes	No	Only if specified
Handles p >> n	Well	Poorly	With regularisation
Interpretability	Variable importance	Loadings	Odds ratios
Probabilistic output	Approximate	Yes	Yes
Assumption-free	Yes	No	Partially

Random forests do not produce odds ratios or loadings. If you need to explain the direction and magnitude of individual predictor effects, logistic regression or LDA is more appropriate. Use random forests when prediction accuracy and variable ranking matter more than coefficient interpretability.

Common Pitfalls

Not setting a seed. Random forests are stochastic. Always use set.seed() before fitting to ensure reproducibility.

Ignoring class imbalance. With unbalanced classes, the majority class dominates predictions. Use the sampsize or classwt arguments to correct for this:

# Equal sampling from each class
rf_balanced <- randomForest(outcome ~ .,
                            data     = data,
                            sampsize = rep(min(table(outcome)), 2),
                            ntree    = 500)

Using importance scores without validation. Variable importance from a random forest trained on all data is optimistic. Compute importance within cross-validation folds for an honest estimate.

Treating importance as effect size. A high importance score means a variable is useful for discrimination, not that it has a large or consistent biological effect. Follow up important variables with targeted statistical tests.

Exercise

Using the iris dataset:

Fit a random forest classifier with 500 trees
Report the OOB error and confusion matrix
Plot variable importance and identify the two most important predictors
Compare OOB accuracy with the cross-validated LDA accuracy from the LDA exercise

Solution

library(randomForest)

data(iris)
set.seed(123)

# 1. Fit
rf <- randomForest(Species ~ ., data = iris,
                   ntree = 500, importance = TRUE)

# 2. OOB error and confusion matrix
print(rf)

# 3. Variable importance
varImpPlot(rf, main = "Iris Variable Importance")
# Petal.Width and Petal.Length are consistently most important

# 4. Compare with LDA
library(MASS)
lda_cv <- lda(Species ~ ., data = iris, CV = TRUE)
cat("LDA LOOCV accuracy:", mean(lda_cv$class == iris$Species), "\n")
cat("RF OOB accuracy:   ", 1 - rf$err.rate[500, "OOB"], "\n")
# Both typically around 96-98% on iris