Theory & Mathematical Foundation

Introduction

This section provides the mathematical foundations of MARS, based entirely on:

Friedman, J.H. (1991). “Multivariate Adaptive Regression Splines.” The Annals of Statistics, 19(1), 1–67.

Every equation, algorithm, and formula in this documentation maps directly to the original paper. Implementation details are noted where applicable.

Multivariate Adaptive Regression Splines

MARS is a nonparametric regression method that learns piecewise-linear (or piecewise-cubic) functions through adaptive knot placement.

The model has the form:

\[\hat{f}(x) = \sum_{m=1}^{M} c_m B_m(x)\]

where: - \(M\) = number of basis functions - \(c_m\) = fitted coefficients - \(B_m(x)\) = basis functions

Basis Functions

Linear Hinges (Friedman 1991, Eq. 19)

Each basis function is a product of univariate hinge functions:

\[B_m(x) = \prod_{k=1}^{K_m} h_k(x)\]

where each hinge function \(h_k(x)\) has the form:

\[h(x_j, t, d) = \max(0, d(x_j - t))\]

with: - \(x_j\) = \(j\)-th predictor - \(t\) = knot location - \(d \in \{-1, +1\}\) = direction indicator - \((z)_+ = \max(0, z)\) = positive part function

Interpretation: - \(h(x_j, +, t) = (x_j - t)_+\) = ReLU on right side of knot - \(h(x_j, -, t) = (t - x_j)_+\) = ReLU on left side of knot

The product creates a “roof function” that is zero outside \([t_1, t_2]\).

Cubic Splines (Friedman 1991, Sec. 3.7)

For smooth continuity, replace hinges with cubic basis:

\[h_{\text{cubic}}(x_j, t, d) = (d(x_j - t))_+^3\]

This gives continuous second derivatives.

Interaction Terms

Products of hinges naturally create interaction terms:

\[B_m(x) = h(x_i, t_1, d_1) \cdot h(x_j, t_2, d_2)\]

This is a 2-way interaction between variables \(i\) and \(j\).

max_degree parameter limits interaction order: - max_degree=1: Only main effects - max_degree=2: Up to 2-way interactions - max_degree=k: Up to k-way interactions

Knot Selection

Minspan Formula (Friedman 1991, Eq. 43)

Minimum observations between knots:

\[L = \left\lfloor -\frac{\log_2(\alpha / N)}{2.5} \right\rfloor\]

where: - \(N\) = sample size - \(\alpha\) = significance level (default 0.05)

Interpretation: Enforces a minimum gap between knots to avoid overfitting.

Endspan Formula (Friedman 1991, Eq. 45)

Minimum observations from domain boundaries:

\[L_e = \left\lceil 3 - \log_2(\alpha / N) \right\rceil\]

Interpretation: Prevents knots very close to the endpoints of the feature range.

The Forward Pass

Algorithm Overview

The forward pass is a greedy basis expansion algorithm (Friedman 1991, Algorithm 2):

Input: Data \((x_i, y_i)\), \(i=1,\ldots,N\)

Output: Set of basis functions \(\mathcal{B}\)

Algorithm:

Initialize: B = {1}  (constant term)

For m = 1 to M_max:
    For each basis function b in B:
        For each variable j in 1..d:
            For each observation i in valid knots (respecting minspan/endspan):
                For each direction d in {+1, -1}:
                    Fit pair: h^+ = h(x_j, t_i, +) * b
                              h^- = h(x_j, t_i, -) * b
                    Compute residuals
                    Calculate RSS reduction

    Add pair with maximum RSS reduction to B

Return B

Key Points:

  1. Greedy selection: Always pick the pair reducing RSS most

  2. Pairwise expansion: Adds two basis functions per iteration

  3. Knot respects constraints: minspan and endspan enforced

  4. Continuation: Continues until \(M_{\max}\) basis functions

Least Squares Solve

At each iteration, solve:

\[\hat{c} = \arg\min_c \sum_{i=1}^{N} \left(y_i - \sum_m c_m B_m(x_i)\right)^2\]

Using QR decomposition → Cholesky → pseudoinverse chain (see Advanced Topics).

The Backward Pass

Generalized Cross-Validation (GCV)

After forward pass, prune basis functions using GCV (Friedman 1991, Eq. 30):

\[\text{GCV}(M) = \frac{\text{RSS}(M)}{N(1 - C(M)/N)^2}\]

where complexity penalty is:

\[C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM\]

with: - \(B\) = \(N \times M\) design matrix - \(d\) = penalty per basis function (typically 2–3)

Interpretation:

  • Numerator: Residual sum of squares (fit quality)

  • Denominator: Complexity adjustment (penalizes more terms)

  • Lower GCV = better model selection

Backward Pruning Algorithm

Algorithm (Friedman 1991, Algorithm 3):

Start: B = set of all basis functions from forward pass

While |B| > 1:
    For each basis function b in B:
        Remove b from B
        Fit model
        Compute GCV(B)
        Add b back

    b* = basis function whose removal minimizes GCV
    Remove b* from B

    If GCV(B) not improved:
        Restore all removed terms
        Break

Return final B

Result: A sequence of nested models with decreasing complexity.

Model Selection: Choose model with minimum GCV score.

GCV Complexity Analysis

Design Matrix

Let \(B\) be the \(N \times M\) design matrix where each column is a basis function.

Effective Degrees of Freedom

The trace term in Eq. 32:

\[\text{df}(M) = \text{trace}(B(B^T B)^{-1}B^T)\]

represents the effective degrees of freedom.

This is between \(M\) (if columns orthogonal) and \(MN\) (saturated).

Penalty Parameter

The GCV penalty \(d\) controls pruning aggressiveness:

  • \(d = 2\) (additive models): Less aggressive pruning

  • \(d = 3\) (general models): Standard pruning

  • \(d = 4\) (conservative): Very aggressive pruning

Effect: Higher \(d\) → simpler models (more pruning).

Model Equations Summary

Fitted Model:

\[\hat{f}(x) = c_0 + \sum_{m=1}^{M} c_m \prod_{k=1}^{K_m} h_k(x)\]

Loss Function (RSS):

\[\text{RSS} = \sum_{i=1}^{N} (y_i - \hat{f}(x_i))^2\]

GCV Selection Criterion:

\[\text{GCV} = \frac{\text{RSS}}{N(1 - C(M)/N)^2}\]

Complexity Penalty:

\[C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM\]

Cubic Spline Extension

Cubic Basis (Friedman 1991, Sec. 3.7)

For smoother models with continuous 2nd derivatives, use cubic hinges:

\[h_{\text{cubic}}(x_j, t, d) = [d(x_j - t)]_+^3\]

This replaces linear hinges in the basis expansion.

Continuity Properties:

  • \(C^0\) continuity: Guaranteed (hinges are continuous)

  • \(C^1\) continuity: Guaranteed (first derivative continuous)

  • \(C^2\) continuity: Guaranteed (second derivative continuous)

Motivation: Cubic bases are smoother but still interpretable, avoiding overly wiggly functions.

Knot Placement (Cubic)

When converting to cubic, place “side knots” to ensure cubic smoothness:

For a hinge at knot \(t\), place side knots at:

\[ \begin{align}\begin{aligned}t^- = \text{median}(x_j : x_j < t)\\t^+ = \text{median}(x_j : x_j > t)\end{aligned}\end{align} \]

This creates a cubic “power basis” with \(C^2\) continuity everywhere.

The cubic coefficient becomes:

\[r^+ = \frac{2}{(t^+ - t^-)^3}\]

(Friedman 1991, Eq. 34–35)

Numerical Implementation

Standardization

Features are standardized to mean 0, variance 1:

\[x_j^{(std)} = \frac{x_j - \bar{x}_j}{s_j}\]

where \(\bar{x}_j\) and \(s_j\) are sample mean and standard deviation.

Benefits: - Improves numerical stability - Makes knot locations comparable across scales - Makes coefficients interpretable (standardized effect sizes)

Knot Validity

A knot at \(t\) for variable \(j\) is valid if:

\[\#\{i : x_{ij} < t\} \geq L \quad \text{and} \quad \#\{i : x_{ij} > t\} \geq L\]

where \(L\) is minspan.

Similarly, endspan constraints ensure sufficient observations near domain boundaries.

Condition Number Management

To avoid singular least-squares problems:

  1. Standardize features (default)

  2. Use pseudoinverse when condition number is high

  3. Avoid collinear basis functions (model pruning helps)

Connection to Paper Sections

Section

Topic

Formula/Algorithm

3.1

Recursive Partitioning

Basis function formulation

3.2

Continuity (q=1)

Linear hinge definition

3.3

Generalization

Interaction terms

3.4

MARS Algorithm

Algorithms 1, 2, 3

3.5

ANOVA

Decomposition analysis

3.6

GCV

Equations 30–32

3.7

Cubic Continuity

Equations 34–35

3.8

Knot Optimization

Minspan, Endspan

3.9

Computational

Efficiency strategies

References

The complete references are in References.

Key citations: - Friedman (1991) – Original MARS paper - Friedman & Silverman (1989) – Flexible smoothing methods

Notation Reference

Symbol

Meaning

\(N\)

Number of observations

\(d\)

Number of features

\(M\)

Number of basis functions

\(x_i\)

\(i\)-th observation (vector)

\(y_i\)

\(i\)-th target value

\(B_m(x)\)

\(m\)-th basis function

\(c_m\)

Coefficient for basis \(m\)

\(t\)

Knot location

\(d\)

Direction (+1 or -1)

\(h(x, t, d)\)

Hinge function

\(\alpha\)

Significance level

\(L\)

Minspan

\(L_e\)

Endspan