Forward Selection of Predictors
When fitting a constrained ordination (RDA, CCA, or db-RDA) you must decide which environmental predictors to include. Including too many predictors inflates the explained variance artificially and risks overfitting: the model fits noise rather than real ecological signal.
Forward selection is a stepwise procedure that adds predictors one at a time, retaining only those that contribute significantly and meaningfully to the model.
Why Forward Selection?
Constrained ordination has a hard structural requirement: you need more sites than predictors. But even well below that ceiling, unnecessary predictors inflate R² and make ordination diagrams harder to interpret.
Forward selection addresses two problems simultaneously:
- Overfitting: each added predictor explains some variance by chance alone
- Collinearity: correlated predictors carry redundant information; including both wastes degrees of freedom
R² is not a reliable guide on its own
In constrained ordination, R² increases monotonically as you add predictors, even if they explain nothing meaningful. Always use adjusted R² (R²adj), which penalises model complexity.
The Blanchet et al. (2008) Procedure
The standard approach in community ecology is the two-step procedure of Blanchet, Legendre & Borcard (2008), implemented in vegan as ordiR2step().
Step 1: Global test
Run a permutation test on the full model. If the full model is not significant, stop: no predictor passes the threshold and forward selection should not proceed.
Step 2: Stepwise addition
Starting from an intercept-only model, add the predictor that maximises R²adj at each step, subject to two stopping criteria applied simultaneously:
| Criterion | Rule |
|---|---|
| Significance | Predictor must be significant by permutation test (p ≤ 0.05) |
| R²adj ceiling | Model R²adj must not exceed the full-model R²adj |
The procedure stops as soon as either criterion is violated.
Why the R²adj ceiling?
Without it, the procedure can keep adding predictors that are individually significant but collectively push the model beyond what the full set of variables can explain. The ceiling anchors selection to the realistic maximum.
Implementation in R
The example below uses the dune community matrix and dune.env environmental table from the vegan package.
library(vegan)
data(dune)
data(dune.env)
# Hellinger-transform the community matrix
dune_hel <- decostand(dune, method = "hellinger")
# Full model: all predictors
rda_full <- rda(dune_hel ~ ., data = dune.env)
# Global permutation test
anova(rda_full, permutations = 999)
# Proceed only if p < 0.05
# Null (intercept-only) model
rda_null <- rda(dune_hel ~ 1, data = dune.env)
# Forward selection
rda_fwd <- ordiR2step(
object = rda_null,
scope = formula(rda_full),
direction = "forward",
R2scope = TRUE, # enforce R²adj ceiling
permutations = 999
)
# Inspect selected model
rda_fwd$call
RsquareAdj(rda_fwd)
anova(rda_fwd, permutations = 999)
library(vegan)
data(dune)
data(dune.env)
# Bray-Curtis distance matrix
dist_bc <- vegdist(dune, method = "bray")
# Full db-RDA model
dbrda_full <- dbrda(dist_bc ~ ., data = dune.env)
# Global permutation test
anova(dbrda_full, permutations = 999)
# Null model
dbrda_null <- dbrda(dist_bc ~ 1, data = dune.env)
# Forward selection
dbrda_fwd <- ordiR2step(
object = dbrda_null,
scope = formula(dbrda_full),
direction = "forward",
R2scope = TRUE,
permutations = 999
)
dbrda_fwd$call
RsquareAdj(dbrda_fwd)
Reading the Output
ordiR2step() prints a selection table at each step. Here is a schematic example:
R2.adj Df AIC F Pr(>F)
<none> 0.0000
+ Management 0.2341 3 -12.4 3.112 0.001 ***
+ A1 0.2891 1 -14.1 2.803 0.003 **
+ Moisture 0.3102 1 -14.6 1.951 0.041 *
+ Use 0.3041 2 -13.8 1.204 0.218
At each step, the procedure selects the predictor with the highest R²adj that passes the significance threshold. Here, Use would be skipped because p = 0.218 > 0.05.
Check VIF after selection
Even after forward selection, collinearity among retained predictors can be a problem. Compute variance inflation factors on the selected model:
vif.cca(dbrda_fwd)
Values above 10 (liberal) or 5 (conservative) warrant further scrutiny.
Common Pitfalls
Don't skip the global test
Running ordiR2step() without first confirming global model significance can produce spurious selections. The function will warn you, but it will still run.
R²adj can be negative
For very weak models, adjusted R² can dip below zero. This does not mean the model is broken. It simply means the predictors explain less than expected by chance at that sample size.
Forward selection is not best-subset selection
Forward selection finds a parsimonious model, not necessarily the best one. Different predictor orderings can yield different final sets, especially when predictors are correlated. Treat the result as a practical reduction tool, not a definitive ranking of predictor importance.
Summary
| Step | Action |
|---|---|
| 1 | Fit full model, run global permutation test |
| 2 | If significant, run ordiR2step() from null model |
| 3 | Retain predictors that are significant and below R²adj ceiling |
| 4 | Check VIF on selected model |
| 5 | Use selected predictors in final ordination and plot |
References
- Blanchet, F.G., Legendre, P. & Borcard, D. (2008). Forward selection of explanatory variables. Ecology, 89, 2623–2632.
- Oksanen, J. et al. (2024). vegan: Community Ecology Package. R package. https://CRAN.R-project.org/package=vegan