Theory & Mathematical Foundation

Introduction

This section provides the mathematical foundations of MARS, based entirely on:

Friedman, J.H. (1991). “Multivariate Adaptive Regression Splines.” The Annals of Statistics, 19(1), 1–67.

Every equation, algorithm, and formula in this documentation maps directly to the original paper. Implementation details are noted where applicable.

Multivariate Adaptive Regression Splines

MARS is a nonparametric regression method that learns piecewise-linear (or piecewise-cubic) functions through adaptive knot placement.

The model has the form:

\[\hat{f}(x) = \sum_{m=1}^{M} c_m B_m(x)\]

where: - \(M\) = number of basis functions - \(c_m\) = fitted coefficients - \(B_m(x)\) = basis functions

Basis Functions

Linear Hinges (Friedman 1991, Eq. 19)

Each basis function is a product of univariate hinge functions:

\[B_m(x) = \prod_{k=1}^{K_m} h_k(x)\]

where each hinge function \(h_k(x)\) has the form:

\[h(x_j, t, d) = \max(0, d(x_j - t))\]

with: - \(x_j\) = \(j\)-th predictor - \(t\) = knot location - \(d \in \{-1, +1\}\) = direction indicator - \((z)_+ = \max(0, z)\) = positive part function

Interpretation: - \(h(x_j, +, t) = (x_j - t)_+\) = ReLU on right side of knot - \(h(x_j, -, t) = (t - x_j)_+\) = ReLU on left side of knot

The product creates a “roof function” that is zero outside \([t_1, t_2]\).

Cubic Splines (Friedman 1991, Sec. 3.7)

For smooth continuity, replace hinges with cubic basis:

\[h_{\text{cubic}}(x_j, t, d) = (d(x_j - t))_+^3\]

This gives continuous second derivatives.

Interaction Terms

Products of hinges naturally create interaction terms:

\[B_m(x) = h(x_i, t_1, d_1) \cdot h(x_j, t_2, d_2)\]

This is a 2-way interaction between variables \(i\) and \(j\).

max_degree parameter limits interaction order: - max_degree=1: Only main effects - max_degree=2: Up to 2-way interactions - max_degree=k: Up to k-way interactions

Knot Selection

Minspan Formula (Friedman 1991, Eq. 43)

Minimum observations between knots:

\[L = \left\lfloor -\frac{\log_2(\alpha / N)}{2.5} \right\rfloor\]

where: - \(N\) = sample size - \(\alpha\) = significance level (default 0.05)

Interpretation: Enforces a minimum gap between knots to avoid overfitting.

Endspan Formula (Friedman 1991, Eq. 45)

Minimum observations from domain boundaries:

\[L_e = \left\lceil 3 - \log_2(\alpha / N) \right\rceil\]

Interpretation: Prevents knots very close to the endpoints of the feature range.

The Forward Pass

Algorithm Overview

The forward pass is a greedy basis expansion algorithm (Friedman 1991, Algorithm 2):

Input: Data \((x_i, y_i)\), \(i=1,\ldots,N\)

Output: Set of basis functions \(\mathcal{B}\)

Algorithm:

Initialize: B = {1}  (constant term)

For m = 1 to M_max:
    For each basis function b in B:
        For each variable j in 1..d:
            For each observation i in valid knots (respecting minspan/endspan):
                For each direction d in {+1, -1}:
                    Fit pair: h^+ = h(x_j, t_i, +) * b
                              h^- = h(x_j, t_i, -) * b
                    Compute residuals
                    Calculate RSS reduction

    Add pair with maximum RSS reduction to B

Return B

Key Points:

Greedy selection: Always pick the pair reducing RSS most
Pairwise expansion: Adds two basis functions per iteration
Knot respects constraints: minspan and endspan enforced
Continuation: Continues until \(M_{\max}\) basis functions

Least Squares Solve

At each iteration, solve:

\[\hat{c} = \arg\min_c \sum_{i=1}^{N} \left(y_i - \sum_m c_m B_m(x_i)\right)^2\]

Using QR decomposition → Cholesky → pseudoinverse chain (see Advanced Topics).

The Backward Pass

Generalized Cross-Validation (GCV)

After forward pass, prune basis functions using GCV (Friedman 1991, Eq. 30):

\[\text{GCV}(M) = \frac{\text{RSS}(M)}{N(1 - C(M)/N)^2}\]

where complexity penalty is:

\[C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM\]

with: - \(B\) = \(N \times M\) design matrix - \(d\) = penalty per basis function (typically 2–3)

Interpretation:

Numerator: Residual sum of squares (fit quality)
Denominator: Complexity adjustment (penalizes more terms)
Lower GCV = better model selection

Backward Pruning Algorithm

Algorithm (Friedman 1991, Algorithm 3):

Start: B = set of all basis functions from forward pass

While |B| > 1:
    For each basis function b in B:
        Remove b from B
        Fit model
        Compute GCV(B)
        Add b back

    b* = basis function whose removal minimizes GCV
    Remove b* from B

    If GCV(B) not improved:
        Restore all removed terms
        Break

Return final B

Result: A sequence of nested models with decreasing complexity.

Model Selection: Choose model with minimum GCV score.

GCV Complexity Analysis

Design Matrix

Let \(B\) be the \(N \times M\) design matrix where each column is a basis function.

Effective Degrees of Freedom

The trace term in Eq. 32:

\[\text{df}(M) = \text{trace}(B(B^T B)^{-1}B^T)\]

represents the effective degrees of freedom.

This is between \(M\) (if columns orthogonal) and \(MN\) (saturated).

Penalty Parameter

The GCV penalty \(d\) controls pruning aggressiveness:

\(d = 2\) (additive models): Less aggressive pruning
\(d = 3\) (general models): Standard pruning
\(d = 4\) (conservative): Very aggressive pruning

Effect: Higher \(d\) → simpler models (more pruning).

Model Equations Summary

Fitted Model:

\[\hat{f}(x) = c_0 + \sum_{m=1}^{M} c_m \prod_{k=1}^{K_m} h_k(x)\]

Loss Function (RSS):

\[\text{RSS} = \sum_{i=1}^{N} (y_i - \hat{f}(x_i))^2\]

GCV Selection Criterion:

\[\text{GCV} = \frac{\text{RSS}}{N(1 - C(M)/N)^2}\]

Complexity Penalty:

\[C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM\]

Cubic Spline Extension

Cubic Basis (Friedman 1991, Sec. 3.7)

For smoother models with continuous 2nd derivatives, use cubic hinges:

\[h_{\text{cubic}}(x_j, t, d) = [d(x_j - t)]_+^3\]

This replaces linear hinges in the basis expansion.

Continuity Properties:

\(C^0\) continuity: Guaranteed (hinges are continuous)
\(C^1\) continuity: Guaranteed (first derivative continuous)
\(C^2\) continuity: Guaranteed (second derivative continuous)

Motivation: Cubic bases are smoother but still interpretable, avoiding overly wiggly functions.

Knot Placement (Cubic)

When converting to cubic, place “side knots” to ensure cubic smoothness:

For a hinge at knot \(t\), place side knots at:

\[ \begin{align}\begin{aligned}t^- = \text{median}(x_j : x_j < t)\\t^+ = \text{median}(x_j : x_j > t)\end{aligned}\end{align} \]

This creates a cubic “power basis” with \(C^2\) continuity everywhere.

The cubic coefficient becomes:

\[r^+ = \frac{2}{(t^+ - t^-)^3}\]

(Friedman 1991, Eq. 34–35)

Numerical Implementation

Standardization

Features are standardized to mean 0, variance 1:

\[x_j^{(std)} = \frac{x_j - \bar{x}_j}{s_j}\]

where \(\bar{x}_j\) and \(s_j\) are sample mean and standard deviation.

Benefits: - Improves numerical stability - Makes knot locations comparable across scales - Makes coefficients interpretable (standardized effect sizes)

Knot Validity

A knot at \(t\) for variable \(j\) is valid if:

\[\#\{i : x_{ij} < t\} \geq L \quad \text{and} \quad \#\{i : x_{ij} > t\} \geq L\]

where \(L\) is minspan.

Similarly, endspan constraints ensure sufficient observations near domain boundaries.

Condition Number Management

To avoid singular least-squares problems:

Standardize features (default)
Use pseudoinverse when condition number is high
Avoid collinear basis functions (model pruning helps)

Connection to Paper Sections

Section	Topic	Formula/Algorithm
3.1	Recursive Partitioning	Basis function formulation
3.2	Continuity (q=1)	Linear hinge definition
3.3	Generalization	Interaction terms
3.4	MARS Algorithm	Algorithms 1, 2, 3
3.5	ANOVA	Decomposition analysis
3.6	GCV	Equations 30–32
3.7	Cubic Continuity	Equations 34–35
3.8	Knot Optimization	Minspan, Endspan
3.9	Computational	Efficiency strategies

References

The complete references are in References.

Key citations: - Friedman (1991) – Original MARS paper - Friedman & Silverman (1989) – Flexible smoothing methods

Notation Reference

Symbol	Meaning
\(N\)	Number of observations
\(d\)	Number of features
\(M\)	Number of basis functions
\(x_i\)	\(i\)-th observation (vector)
\(y_i\)	\(i\)-th target value
\(B_m(x)\)	\(m\)-th basis function
\(c_m\)	Coefficient for basis \(m\)
\(t\)	Knot location
\(d\)	Direction (+1 or -1)
\(h(x, t, d)\)	Hinge function
\(\alpha\)	Significance level
\(L\)	Minspan
\(L_e\)	Endspan