Model Selection & GCV

Overview

Model selection is a critical component of MARS. The algorithm automatically selects the “right-sized” model using Generalized Cross-Validation (GCV), which balances fit quality against model complexity.

The GCV Problem

Central Question: How many basis functions should we keep?

Too few: Underfitting (high bias, poor R²)
Too many: Overfitting (high variance, poor generalization)

Solution: Use a criterion that penalizes both:

Training error (RSS) – How well the model fits the data
Model complexity (df) – How many parameters we estimated

The GCV Formula

Generalized Cross-Validation (Friedman 1991, Eq. 30):

\[\text{GCV}(M) = \frac{\text{RSS}(M)}{N \left(1 - \frac{C(M)}{N}\right)^2}\]

where:

Residual Sum of Squares:

\[\text{RSS}(M) = \sum_{i=1}^{N} (y_i - \hat{f}_M(x_i))^2\]

Effective Degrees of Freedom:

\[C(M) = \text{trace}(B(B^T B)^{-1}B^T) + d \cdot M\]

with: - \(B\) = \(N \times M\) design matrix (M basis functions) - \(d\) = penalty per basis function (typically 2–3) - \(M\) = number of basis functions

Intuition

\[\text{GCV} = \frac{\text{fit error}}{\text{complexity penalty}^2}\]

As M increases: - Numerator (RSS) decreases → better fit - Denominator penalty increases → penalizes complexity

Optimal M: Minimizes GCV score

Example Curve:

GCV Score
   ▲
   │     (underfitting)          (overfitting)
   │         ╱╲                 ╱
   │        ╱  ╲               ╱
   │       ╱    ╲             ╱
   │      ╱      ╲___________╱ ← minimum (optimal)
   │     ╱
   └─────────────────────────────► M (basis functions)
      0   5   10  15  20  25

The Forward Pass

Greedy basis expansion that usually overshoots the optimal model size.

Process:

Start with constant model (M=1)
Add basis function pairs sequentially
Continue until M = max_terms (overfit)
Result: Overly complex model with high variance

Example:

Iteration 1: M=1 (constant)     RSS = 10.5
Iteration 2: M=3 (+ 2 hinges)   RSS = 5.2
Iteration 3: M=5               RSS = 2.8
Iteration 4: M=7               RSS = 1.5
...
Iteration 15: M=29             RSS = 0.01 (overfitting!)

The Backward Pass

Goal: Remove redundant basis functions using GCV.

Process:

Start with full model from forward pass (overfitted)
Try removing each basis function one at a time
Refit and compute GCV
Permanently remove function that minimizes GCV
Repeat until no improvement

Example Sequence:

Model 1: M=29  GCV=0.285
Model 2: M=27  GCV=0.127  ← better! (removed 2 hinges)
Model 3: M=25  GCV=0.089  ← better! (removed 2 more)
Model 4: M=23  GCV=0.074  ← better!
Model 5: M=21  GCV=0.062  ← better!
Model 6: M=19  GCV=0.058  ← better!
Model 7: M=17  GCV=0.063  ← worse! (stop pruning)

✓ Select Model 6 (M=19) with minimum GCV=0.058

Implementation

GCVCalculator Class

from pymars.gcv import GCVCalculator

# Create calculator
gcv_calc = GCVCalculator(penalty=3.0)

# Compute GCV
gcv_score = gcv_calc.calculate(B, y)

where:
  B = Design matrix (N x M)
  y = Target vector (N,)
  penalty = d in formula above

  Returns: scalar GCV score

Computing GCV Step-by-Step

Step 1: Fit model

# Solve least squares: c = (B^T B)^-1 B^T y
c = solve_least_squares(B, y)

Step 2: Compute RSS

residuals = y - B @ c
RSS = np.sum(residuals**2)

Step 3: Compute effective degrees of freedom

# df = trace(B @ inv(B^T B) @ B^T) + d * M

# Method 1 (explicit trace)
BtB_inv = np.linalg.pinv(B.T @ B)
trace_term = np.trace(B @ BtB_inv @ B.T)

# Method 2 (via SVD, more stable)
U, s, Vt = np.linalg.svd(B, full_matrices=False)
trace_term = np.sum(s**2 / (s**2 + 1e-10))

df = trace_term + penalty * B.shape[1]

Step 4: Compute GCV

N = B.shape[0]
gcv = RSS / (N * (1 - df/N)**2)

Penalty Parameter

The penalty \(d\) in the GCV formula controls pruning severity.

Common Values:

Penalty	Model Type	Effect
2.0	Additive	Less aggressive pruning
2.5	Balanced	Medium pruning
3.0	General (default)	Standard pruning
4.0	Conservative	Very aggressive pruning

How to Choose:

# Additive model (no interactions)
model = MARS(penalty=2.0)

# General model with interactions
model = MARS(penalty=3.0)  # default

# Very parsimonious model
model = MARS(penalty=4.0)

Effect on Final Model Size:

Higher penalty → fewer basis functions → simpler model

Example:

for penalty in [2.0, 3.0, 4.0]:
    model = MARS(penalty=penalty)
    model.fit(X, y)
    print(f"penalty={penalty}: {len(model.basis_functions_)} basis")

# Output:
# penalty=2.0: 18 basis
# penalty=3.0: 12 basis
# penalty=4.0: 8 basis

Alternative: BIC

Instead of GCV, you can use Bayesian Information Criterion (BIC):

\[\text{BIC}(M) = N \log(\text{RSS}/N) + M \log(N)\]

Comparison:

Criterion	Formula	Penalty Type	Typical Result
GCV	RSS / [N(1 - df/N)²]	Automatic (data-driven)	Intermediate complexity
AIC	RSS + 2M	Linear (2 per term)	Slightly more complex
BIC	RSS + M*log(N)	Increases with N	Less complex for large N
Cp (Mallows)	RSS + 2σ²M	Requires σ² estimate	Similar to AIC

When to use alternatives:

BIC: If you have large sample (N > 1000)
AIC: If you prefer less aggressive pruning
Cp: If you have reliable error variance estimate

Manual Model Selection

You can also manually inspect the GCV sequence and choose:

from pymars.model import BackwardPass

# Get full backward sequence
bp = BackwardPass(basis_functions, coefficients, X, y, penalty=3.0)
sequence = bp.run()

# Print GCV sequence
for i, (model_info, gcv_score) in enumerate(sequence):
    n_basis = len(model_info['basis_functions'])
    print(f"Model {i}: {n_basis} basis, GCV={gcv_score:.6f}")

# Manually select (e.g., prefer sparsity)
selected_idx = 5  # Pick model 5 instead of minimum GCV
selected_model = sequence[selected_idx]

Cross-Validation vs GCV

GCV Advantages:

✅ Fast (no refitting)
✅ Deterministic (no randomness)
✅ Works well in practice
✅ Computationally efficient

Cross-Validation Advantages:

✅ Unbiased generalization estimate
✅ Works even if GCV assumption violated
✅ More flexible (any loss function)

Comparison:

from sklearn.model_selection import cross_val_score

model = MARS(penalty=3.0)
model.fit(X, y)

# GCV score (automatic, from fitting)
print(f"GCV: {model.gcv_score_:.6f}")

# 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV Mean: {cv_scores.mean():.6f}")
print(f"CV Std: {cv_scores.std():.6f}")

Rule of Thumb: GCV and cross-validation usually agree (~within 10%).

Hyperparameter Tuning

Use GridSearchCV to find optimal penalty:

from sklearn.model_selection import GridSearchCV

# Parameter grid
params = {
    'penalty': [2.0, 2.5, 3.0, 3.5, 4.0],
    'max_terms': [20, 30, 40],
    'max_degree': [1, 2]
}

# Grid search with cross-validation
grid = GridSearchCV(
    MARS(),
    params,
    cv=5,
    scoring='r2'
)
grid.fit(X, y)

# Results
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.6f}")

# Use best model
best_model = grid.best_estimator_
best_model.fit(X, y)

Visualizing Model Selection

Plot GCV vs model complexity:

import matplotlib.pyplot as plt
from pymars.model import BackwardPass

# Get backward sequence
sequence = bp.run()

n_basis_list = []
gcv_scores = []

for model_info, gcv in sequence:
    n_basis = len(model_info['basis_functions'])
    n_basis_list.append(n_basis)
    gcv_scores.append(gcv)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(n_basis_list, gcv_scores, 'o-', linewidth=2, markersize=8)

# Mark minimum
min_idx = np.argmin(gcv_scores)
plt.scatter(n_basis_list[min_idx], gcv_scores[min_idx],
            color='red', s=200, marker='*', zorder=5,
            label=f'Minimum at {n_basis_list[min_idx]} basis')

plt.xlabel('Number of Basis Functions')
plt.ylabel('GCV Score')
plt.title('Model Selection: GCV vs Complexity')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Common Issues

GCV Increases (No Improvement)

Problem: GCV goes up as you prune; backward pass doesn’t remove anything.

Causes: - Forward pass already found near-optimal model - max_terms too low - penalty too high

Solution:

# Reduce penalty (less aggressive pruning)
model = MARS(penalty=2.5)

GCV Unstable

Problem: GCV fluctuates wildly; hard to interpret.

Causes: - Small sample size - High condition number

Solution:

# Increase penalty (smooth out pruning)
# Or increase max_terms to get better selection
model = MARS(max_terms=50, penalty=3.5)

Overfitting Despite GCV

Problem: High training R² but low test R².

Causes: - GCV assumption violated - Noisy data with complicated true function

Solution:

# Use cross-validation penalty
# Or increase penalty
model = MARS(penalty=4.0)

# Always validate on test set
test_r2 = model.score(X_test, y_test)

Best Practices

Always standardize features – improves numerical stability
Use default penalty=3.0 – proven default for general models
Validate on test set – GCV is training estimate
Try multiple penalties – ensemble or sensitivity check
Plot GCV sequence – understand model selection process
Check residuals – ensure model assumptions met

Summary

Key Points:

✅ GCV automatically balances fit vs. complexity
✅ Forward pass overshoots (overfits); backward prunes
✅ Penalty parameter controls pruning severity
✅ Default penalty=3.0 works well
✅ Always validate final model on test data
✅ Cross-validation provides independent estimate

Formula to remember:

\[\text{GCV} = \frac{\text{RSS}}{N(1 - C(M)/N)^2}\]

Where:

\[C(M) = \text{trace}(B(B^T B)^{-1}B^T) + dM\]