==================================== Theory & Mathematical Foundation ==================================== .. contents:: :local: :backlinks: none Introduction ============ This section provides the mathematical foundations of MARS, based entirely on: **Friedman, J.H. (1991).** "Multivariate Adaptive Regression Splines." *The Annals of Statistics*, 19(1), 1–67. Every equation, algorithm, and formula in this documentation maps directly to the original paper. Implementation details are noted where applicable. Multivariate Adaptive Regression Splines ========================================= MARS is a **nonparametric regression method** that learns piecewise-linear (or piecewise-cubic) functions through adaptive knot placement. The model has the form: .. math:: \hat{f}(x) = \sum_{m=1}^{M} c_m B_m(x) where: - :math:`M` = number of basis functions - :math:`c_m` = fitted coefficients - :math:`B_m(x)` = basis functions Basis Functions =============== Linear Hinges (Friedman 1991, Eq. 19) ------------------------------------- Each basis function is a **product of univariate hinge functions**: .. math:: B_m(x) = \prod_{k=1}^{K_m} h_k(x) where each hinge function :math:`h_k(x)` has the form: .. math:: h(x_j, t, d) = \max(0, d(x_j - t)) with: - :math:`x_j` = :math:`j`-th predictor - :math:`t` = knot location - :math:`d \in \{-1, +1\}` = direction indicator - :math:`(z)_+ = \max(0, z)` = positive part function **Interpretation:** - :math:`h(x_j, +, t) = (x_j - t)_+` = ReLU on right side of knot - :math:`h(x_j, -, t) = (t - x_j)_+` = ReLU on left side of knot The product creates a "roof function" that is zero outside :math:`[t_1, t_2]`. Cubic Splines (Friedman 1991, Sec. 3.7) ----------------------------------------- For smooth continuity, replace hinges with cubic basis: .. math:: h_{\text{cubic}}(x_j, t, d) = (d(x_j - t))_+^3 This gives continuous second derivatives. Interaction Terms ----------------- Products of hinges naturally create interaction terms: .. math:: B_m(x) = h(x_i, t_1, d_1) \cdot h(x_j, t_2, d_2) This is a **2-way interaction** between variables :math:`i` and :math:`j`. **max_degree** parameter limits interaction order: - max_degree=1: Only main effects - max_degree=2: Up to 2-way interactions - max_degree=k: Up to k-way interactions Knot Selection ============== Minspan Formula (Friedman 1991, Eq. 43) ----------------------------------------- Minimum observations between knots: .. math:: L = \left\lfloor -\frac{\log_2(\alpha / N)}{2.5} \right\rfloor where: - :math:`N` = sample size - :math:`\alpha` = significance level (default 0.05) **Interpretation:** Enforces a minimum gap between knots to avoid overfitting. Endspan Formula (Friedman 1991, Eq. 45) ----------------------------------------- Minimum observations from domain boundaries: .. math:: L_e = \left\lceil 3 - \log_2(\alpha / N) \right\rceil **Interpretation:** Prevents knots very close to the endpoints of the feature range. The Forward Pass ================ Algorithm Overview ------------------ The forward pass is a **greedy basis expansion** algorithm (Friedman 1991, Algorithm 2): **Input:** Data :math:`(x_i, y_i)`, :math:`i=1,\ldots,N` **Output:** Set of basis functions :math:`\mathcal{B}` **Algorithm:** .. code-block:: text Initialize: B = {1} (constant term) For m = 1 to M_max: For each basis function b in B: For each variable j in 1..d: For each observation i in valid knots (respecting minspan/endspan): For each direction d in {+1, -1}: Fit pair: h^+ = h(x_j, t_i, +) * b h^- = h(x_j, t_i, -) * b Compute residuals Calculate RSS reduction Add pair with maximum RSS reduction to B Return B **Key Points:** 1. **Greedy selection:** Always pick the pair reducing RSS most 2. **Pairwise expansion:** Adds two basis functions per iteration 3. **Knot respects constraints:** minspan and endspan enforced 4. **Continuation:** Continues until :math:`M_{\max}` basis functions Least Squares Solve ------------------- At each iteration, solve: .. math:: \hat{c} = \arg\min_c \sum_{i=1}^{N} \left(y_i - \sum_m c_m B_m(x_i)\right)^2 Using QR decomposition → Cholesky → pseudoinverse chain (see :doc:`advanced_topics`). The Backward Pass ================= Generalized Cross-Validation (GCV) ----------------------------------- After forward pass, **prune basis functions** using GCV (Friedman 1991, Eq. 30): .. math:: \text{GCV}(M) = \frac{\text{RSS}(M)}{N(1 - C(M)/N)^2} where **complexity penalty** is: .. math:: C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM with: - :math:`B` = :math:`N \times M` design matrix - :math:`d` = penalty per basis function (typically 2–3) **Interpretation:** - Numerator: Residual sum of squares (fit quality) - Denominator: Complexity adjustment (penalizes more terms) - Lower GCV = better model selection Backward Pruning Algorithm --------------------------- **Algorithm** (Friedman 1991, Algorithm 3): .. code-block:: text Start: B = set of all basis functions from forward pass While |B| > 1: For each basis function b in B: Remove b from B Fit model Compute GCV(B) Add b back b* = basis function whose removal minimizes GCV Remove b* from B If GCV(B) not improved: Restore all removed terms Break Return final B **Result:** A sequence of nested models with decreasing complexity. **Model Selection:** Choose model with minimum GCV score. GCV Complexity Analysis ======================== Design Matrix ------------- Let :math:`B` be the :math:`N \times M` design matrix where each column is a basis function. Effective Degrees of Freedom ----------------------------- The trace term in Eq. 32: .. math:: \text{df}(M) = \text{trace}(B(B^T B)^{-1}B^T) represents the **effective degrees of freedom**. This is between :math:`M` (if columns orthogonal) and :math:`MN` (saturated). Penalty Parameter ------------------ The GCV penalty :math:`d` controls pruning aggressiveness: - :math:`d = 2` (additive models): Less aggressive pruning - :math:`d = 3` (general models): Standard pruning - :math:`d = 4` (conservative): Very aggressive pruning **Effect:** Higher :math:`d` → simpler models (more pruning). Model Equations Summary ======================== **Fitted Model:** .. math:: \hat{f}(x) = c_0 + \sum_{m=1}^{M} c_m \prod_{k=1}^{K_m} h_k(x) **Loss Function (RSS):** .. math:: \text{RSS} = \sum_{i=1}^{N} (y_i - \hat{f}(x_i))^2 **GCV Selection Criterion:** .. math:: \text{GCV} = \frac{\text{RSS}}{N(1 - C(M)/N)^2} **Complexity Penalty:** .. math:: C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM Cubic Spline Extension ====================== Cubic Basis (Friedman 1991, Sec. 3.7) --------------------------------------- For smoother models with continuous 2nd derivatives, use cubic hinges: .. math:: h_{\text{cubic}}(x_j, t, d) = [d(x_j - t)]_+^3 This replaces linear hinges in the basis expansion. **Continuity Properties:** - :math:`C^0` continuity: Guaranteed (hinges are continuous) - :math:`C^1` continuity: Guaranteed (first derivative continuous) - :math:`C^2` continuity: Guaranteed (second derivative continuous) **Motivation:** Cubic bases are smoother but still interpretable, avoiding overly wiggly functions. Knot Placement (Cubic) ----------------------- When converting to cubic, place "side knots" to ensure cubic smoothness: For a hinge at knot :math:`t`, place side knots at: .. math:: t^- = \text{median}(x_j : x_j < t) t^+ = \text{median}(x_j : x_j > t) This creates a cubic "power basis" with :math:`C^2` continuity everywhere. The cubic coefficient becomes: .. math:: r^+ = \frac{2}{(t^+ - t^-)^3} (Friedman 1991, Eq. 34–35) Numerical Implementation ======================== Standardization --------------- Features are standardized to mean 0, variance 1: .. math:: x_j^{(std)} = \frac{x_j - \bar{x}_j}{s_j} where :math:`\bar{x}_j` and :math:`s_j` are sample mean and standard deviation. **Benefits:** - Improves numerical stability - Makes knot locations comparable across scales - Makes coefficients interpretable (standardized effect sizes) Knot Validity -------------- A knot at :math:`t` for variable :math:`j` is **valid** if: .. math:: \#\{i : x_{ij} < t\} \geq L \quad \text{and} \quad \#\{i : x_{ij} > t\} \geq L where :math:`L` is minspan. Similarly, **endspan** constraints ensure sufficient observations near domain boundaries. Condition Number Management ---------------------------- To avoid singular least-squares problems: 1. Standardize features (default) 2. Use pseudoinverse when condition number is high 3. Avoid collinear basis functions (model pruning helps) Connection to Paper Sections ============================= .. list-table:: :header-rows: 1 * - Section - Topic - Formula/Algorithm * - 3.1 - Recursive Partitioning - Basis function formulation * - 3.2 - Continuity (q=1) - Linear hinge definition * - 3.3 - Generalization - Interaction terms * - 3.4 - MARS Algorithm - Algorithms 1, 2, 3 * - 3.5 - ANOVA - Decomposition analysis * - 3.6 - GCV - Equations 30–32 * - 3.7 - Cubic Continuity - Equations 34–35 * - 3.8 - Knot Optimization - Minspan, Endspan * - 3.9 - Computational - Efficiency strategies References ========== The complete references are in :doc:`references`. Key citations: - Friedman (1991) – Original MARS paper - Friedman & Silverman (1989) – Flexible smoothing methods Notation Reference ================== .. list-table:: :header-rows: 1 * - Symbol - Meaning * - :math:`N` - Number of observations * - :math:`d` - Number of features * - :math:`M` - Number of basis functions * - :math:`x_i` - :math:`i`-th observation (vector) * - :math:`y_i` - :math:`i`-th target value * - :math:`B_m(x)` - :math:`m`-th basis function * - :math:`c_m` - Coefficient for basis :math:`m` * - :math:`t` - Knot location * - :math:`d` - Direction (+1 or -1) * - :math:`h(x, t, d)` - Hinge function * - :math:`\alpha` - Significance level * - :math:`L` - Minspan * - :math:`L_e` - Endspan