Theory & Mathematical Foundation
Introduction
This section provides the mathematical foundations of MARS, based entirely on:
Friedman, J.H. (1991). “Multivariate Adaptive Regression Splines.” The Annals of Statistics, 19(1), 1–67.
Every equation, algorithm, and formula in this documentation maps directly to the original paper. Implementation details are noted where applicable.
Multivariate Adaptive Regression Splines
MARS is a nonparametric regression method that learns piecewise-linear (or piecewise-cubic) functions through adaptive knot placement.
The model has the form:
where: - \(M\) = number of basis functions - \(c_m\) = fitted coefficients - \(B_m(x)\) = basis functions
Basis Functions
Linear Hinges (Friedman 1991, Eq. 19)
Each basis function is a product of univariate hinge functions:
where each hinge function \(h_k(x)\) has the form:
with: - \(x_j\) = \(j\)-th predictor - \(t\) = knot location - \(d \in \{-1, +1\}\) = direction indicator - \((z)_+ = \max(0, z)\) = positive part function
Interpretation: - \(h(x_j, +, t) = (x_j - t)_+\) = ReLU on right side of knot - \(h(x_j, -, t) = (t - x_j)_+\) = ReLU on left side of knot
The product creates a “roof function” that is zero outside \([t_1, t_2]\).
Cubic Splines (Friedman 1991, Sec. 3.7)
For smooth continuity, replace hinges with cubic basis:
This gives continuous second derivatives.
Interaction Terms
Products of hinges naturally create interaction terms:
This is a 2-way interaction between variables \(i\) and \(j\).
max_degree parameter limits interaction order: - max_degree=1: Only main effects - max_degree=2: Up to 2-way interactions - max_degree=k: Up to k-way interactions
Knot Selection
Minspan Formula (Friedman 1991, Eq. 43)
Minimum observations between knots:
where: - \(N\) = sample size - \(\alpha\) = significance level (default 0.05)
Interpretation: Enforces a minimum gap between knots to avoid overfitting.
Endspan Formula (Friedman 1991, Eq. 45)
Minimum observations from domain boundaries:
Interpretation: Prevents knots very close to the endpoints of the feature range.
The Forward Pass
Algorithm Overview
The forward pass is a greedy basis expansion algorithm (Friedman 1991, Algorithm 2):
Input: Data \((x_i, y_i)\), \(i=1,\ldots,N\)
Output: Set of basis functions \(\mathcal{B}\)
Algorithm:
Initialize: B = {1} (constant term)
For m = 1 to M_max:
For each basis function b in B:
For each variable j in 1..d:
For each observation i in valid knots (respecting minspan/endspan):
For each direction d in {+1, -1}:
Fit pair: h^+ = h(x_j, t_i, +) * b
h^- = h(x_j, t_i, -) * b
Compute residuals
Calculate RSS reduction
Add pair with maximum RSS reduction to B
Return B
Key Points:
Greedy selection: Always pick the pair reducing RSS most
Pairwise expansion: Adds two basis functions per iteration
Knot respects constraints: minspan and endspan enforced
Continuation: Continues until \(M_{\max}\) basis functions
Least Squares Solve
At each iteration, solve:
Using QR decomposition → Cholesky → pseudoinverse chain (see Advanced Topics).
The Backward Pass
Generalized Cross-Validation (GCV)
After forward pass, prune basis functions using GCV (Friedman 1991, Eq. 30):
where complexity penalty is:
with: - \(B\) = \(N \times M\) design matrix - \(d\) = penalty per basis function (typically 2–3)
Interpretation:
Numerator: Residual sum of squares (fit quality)
Denominator: Complexity adjustment (penalizes more terms)
Lower GCV = better model selection
Backward Pruning Algorithm
Algorithm (Friedman 1991, Algorithm 3):
Start: B = set of all basis functions from forward pass
While |B| > 1:
For each basis function b in B:
Remove b from B
Fit model
Compute GCV(B)
Add b back
b* = basis function whose removal minimizes GCV
Remove b* from B
If GCV(B) not improved:
Restore all removed terms
Break
Return final B
Result: A sequence of nested models with decreasing complexity.
Model Selection: Choose model with minimum GCV score.
GCV Complexity Analysis
Design Matrix
Let \(B\) be the \(N \times M\) design matrix where each column is a basis function.
Effective Degrees of Freedom
The trace term in Eq. 32:
represents the effective degrees of freedom.
This is between \(M\) (if columns orthogonal) and \(MN\) (saturated).
Penalty Parameter
The GCV penalty \(d\) controls pruning aggressiveness:
\(d = 2\) (additive models): Less aggressive pruning
\(d = 3\) (general models): Standard pruning
\(d = 4\) (conservative): Very aggressive pruning
Effect: Higher \(d\) → simpler models (more pruning).
Model Equations Summary
Fitted Model:
Loss Function (RSS):
GCV Selection Criterion:
Complexity Penalty:
Cubic Spline Extension
Cubic Basis (Friedman 1991, Sec. 3.7)
For smoother models with continuous 2nd derivatives, use cubic hinges:
This replaces linear hinges in the basis expansion.
Continuity Properties:
\(C^0\) continuity: Guaranteed (hinges are continuous)
\(C^1\) continuity: Guaranteed (first derivative continuous)
\(C^2\) continuity: Guaranteed (second derivative continuous)
Motivation: Cubic bases are smoother but still interpretable, avoiding overly wiggly functions.
Knot Placement (Cubic)
When converting to cubic, place “side knots” to ensure cubic smoothness:
For a hinge at knot \(t\), place side knots at:
This creates a cubic “power basis” with \(C^2\) continuity everywhere.
The cubic coefficient becomes:
(Friedman 1991, Eq. 34–35)
Numerical Implementation
Standardization
Features are standardized to mean 0, variance 1:
where \(\bar{x}_j\) and \(s_j\) are sample mean and standard deviation.
Benefits: - Improves numerical stability - Makes knot locations comparable across scales - Makes coefficients interpretable (standardized effect sizes)
Knot Validity
A knot at \(t\) for variable \(j\) is valid if:
where \(L\) is minspan.
Similarly, endspan constraints ensure sufficient observations near domain boundaries.
Condition Number Management
To avoid singular least-squares problems:
Standardize features (default)
Use pseudoinverse when condition number is high
Avoid collinear basis functions (model pruning helps)
Connection to Paper Sections
Section |
Topic |
Formula/Algorithm |
|---|---|---|
3.1 |
Recursive Partitioning |
Basis function formulation |
3.2 |
Continuity (q=1) |
Linear hinge definition |
3.3 |
Generalization |
Interaction terms |
3.4 |
MARS Algorithm |
Algorithms 1, 2, 3 |
3.5 |
ANOVA |
Decomposition analysis |
3.6 |
GCV |
Equations 30–32 |
3.7 |
Cubic Continuity |
Equations 34–35 |
3.8 |
Knot Optimization |
Minspan, Endspan |
3.9 |
Computational |
Efficiency strategies |
References
The complete references are in References.
Key citations: - Friedman (1991) – Original MARS paper - Friedman & Silverman (1989) – Flexible smoothing methods
Notation Reference
Symbol |
Meaning |
|---|---|
\(N\) |
Number of observations |
\(d\) |
Number of features |
\(M\) |
Number of basis functions |
\(x_i\) |
\(i\)-th observation (vector) |
\(y_i\) |
\(i\)-th target value |
\(B_m(x)\) |
\(m\)-th basis function |
\(c_m\) |
Coefficient for basis \(m\) |
\(t\) |
Knot location |
\(d\) |
Direction (+1 or -1) |
\(h(x, t, d)\) |
Hinge function |
\(\alpha\) |
Significance level |
\(L\) |
Minspan |
\(L_e\) |
Endspan |