Model Selection & GCV
Overview
Model selection is a critical component of MARS. The algorithm automatically selects the “right-sized” model using Generalized Cross-Validation (GCV), which balances fit quality against model complexity.
The GCV Problem
Central Question: How many basis functions should we keep?
Too few: Underfitting (high bias, poor R²)
Too many: Overfitting (high variance, poor generalization)
Solution: Use a criterion that penalizes both:
Training error (RSS) – How well the model fits the data
Model complexity (df) – How many parameters we estimated
The GCV Formula
Generalized Cross-Validation (Friedman 1991, Eq. 30):
where:
Residual Sum of Squares:
Effective Degrees of Freedom:
with: - \(B\) = \(N \times M\) design matrix (M basis functions) - \(d\) = penalty per basis function (typically 2–3) - \(M\) = number of basis functions
Intuition
As M increases: - Numerator (RSS) decreases → better fit - Denominator penalty increases → penalizes complexity
Optimal M: Minimizes GCV score
Example Curve:
GCV Score
▲
│ (underfitting) (overfitting)
│ ╱╲ ╱
│ ╱ ╲ ╱
│ ╱ ╲ ╱
│ ╱ ╲___________╱ ← minimum (optimal)
│ ╱
└─────────────────────────────► M (basis functions)
0 5 10 15 20 25
The Forward Pass
Greedy basis expansion that usually overshoots the optimal model size.
Process:
Start with constant model (M=1)
Add basis function pairs sequentially
Continue until M = max_terms (overfit)
Result: Overly complex model with high variance
Example:
Iteration 1: M=1 (constant) RSS = 10.5
Iteration 2: M=3 (+ 2 hinges) RSS = 5.2
Iteration 3: M=5 RSS = 2.8
Iteration 4: M=7 RSS = 1.5
...
Iteration 15: M=29 RSS = 0.01 (overfitting!)
The Backward Pass
Goal: Remove redundant basis functions using GCV.
Process:
Start with full model from forward pass (overfitted)
Try removing each basis function one at a time
Refit and compute GCV
Permanently remove function that minimizes GCV
Repeat until no improvement
Example Sequence:
Model 1: M=29 GCV=0.285
Model 2: M=27 GCV=0.127 ← better! (removed 2 hinges)
Model 3: M=25 GCV=0.089 ← better! (removed 2 more)
Model 4: M=23 GCV=0.074 ← better!
Model 5: M=21 GCV=0.062 ← better!
Model 6: M=19 GCV=0.058 ← better!
Model 7: M=17 GCV=0.063 ← worse! (stop pruning)
✓ Select Model 6 (M=19) with minimum GCV=0.058
Implementation
GCVCalculator Class
from pymars.gcv import GCVCalculator
# Create calculator
gcv_calc = GCVCalculator(penalty=3.0)
# Compute GCV
gcv_score = gcv_calc.calculate(B, y)
where:
B = Design matrix (N x M)
y = Target vector (N,)
penalty = d in formula above
Returns: scalar GCV score
Computing GCV Step-by-Step
Step 1: Fit model
# Solve least squares: c = (B^T B)^-1 B^T y
c = solve_least_squares(B, y)
Step 2: Compute RSS
residuals = y - B @ c
RSS = np.sum(residuals**2)
Step 3: Compute effective degrees of freedom
# df = trace(B @ inv(B^T B) @ B^T) + d * M
# Method 1 (explicit trace)
BtB_inv = np.linalg.pinv(B.T @ B)
trace_term = np.trace(B @ BtB_inv @ B.T)
# Method 2 (via SVD, more stable)
U, s, Vt = np.linalg.svd(B, full_matrices=False)
trace_term = np.sum(s**2 / (s**2 + 1e-10))
df = trace_term + penalty * B.shape[1]
Step 4: Compute GCV
N = B.shape[0]
gcv = RSS / (N * (1 - df/N)**2)
Penalty Parameter
The penalty \(d\) in the GCV formula controls pruning severity.
Common Values:
Penalty |
Model Type |
Effect |
|---|---|---|
2.0 |
Additive |
Less aggressive pruning |
2.5 |
Balanced |
Medium pruning |
3.0 |
General (default) |
Standard pruning |
4.0 |
Conservative |
Very aggressive pruning |
How to Choose:
# Additive model (no interactions)
model = MARS(penalty=2.0)
# General model with interactions
model = MARS(penalty=3.0) # default
# Very parsimonious model
model = MARS(penalty=4.0)
Effect on Final Model Size:
Higher penalty → fewer basis functions → simpler model
Example:
for penalty in [2.0, 3.0, 4.0]:
model = MARS(penalty=penalty)
model.fit(X, y)
print(f"penalty={penalty}: {len(model.basis_functions_)} basis")
# Output:
# penalty=2.0: 18 basis
# penalty=3.0: 12 basis
# penalty=4.0: 8 basis
Alternative: BIC
Instead of GCV, you can use Bayesian Information Criterion (BIC):
Comparison:
Criterion |
Formula |
Penalty Type |
Typical Result |
|---|---|---|---|
GCV |
RSS / [N(1 - df/N)²] |
Automatic (data-driven) |
Intermediate complexity |
AIC |
RSS + 2M |
Linear (2 per term) |
Slightly more complex |
BIC |
RSS + M*log(N) |
Increases with N |
Less complex for large N |
Cp (Mallows) |
RSS + 2σ²M |
Requires σ² estimate |
Similar to AIC |
When to use alternatives:
BIC: If you have large sample (N > 1000)
AIC: If you prefer less aggressive pruning
Cp: If you have reliable error variance estimate
Manual Model Selection
You can also manually inspect the GCV sequence and choose:
from pymars.model import BackwardPass
# Get full backward sequence
bp = BackwardPass(basis_functions, coefficients, X, y, penalty=3.0)
sequence = bp.run()
# Print GCV sequence
for i, (model_info, gcv_score) in enumerate(sequence):
n_basis = len(model_info['basis_functions'])
print(f"Model {i}: {n_basis} basis, GCV={gcv_score:.6f}")
# Manually select (e.g., prefer sparsity)
selected_idx = 5 # Pick model 5 instead of minimum GCV
selected_model = sequence[selected_idx]
Cross-Validation vs GCV
GCV Advantages:
✅ Fast (no refitting)
✅ Deterministic (no randomness)
✅ Works well in practice
✅ Computationally efficient
Cross-Validation Advantages:
✅ Unbiased generalization estimate
✅ Works even if GCV assumption violated
✅ More flexible (any loss function)
Comparison:
from sklearn.model_selection import cross_val_score
model = MARS(penalty=3.0)
model.fit(X, y)
# GCV score (automatic, from fitting)
print(f"GCV: {model.gcv_score_:.6f}")
# 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV Mean: {cv_scores.mean():.6f}")
print(f"CV Std: {cv_scores.std():.6f}")
Rule of Thumb: GCV and cross-validation usually agree (~within 10%).
Hyperparameter Tuning
Use GridSearchCV to find optimal penalty:
from sklearn.model_selection import GridSearchCV
# Parameter grid
params = {
'penalty': [2.0, 2.5, 3.0, 3.5, 4.0],
'max_terms': [20, 30, 40],
'max_degree': [1, 2]
}
# Grid search with cross-validation
grid = GridSearchCV(
MARS(),
params,
cv=5,
scoring='r2'
)
grid.fit(X, y)
# Results
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.6f}")
# Use best model
best_model = grid.best_estimator_
best_model.fit(X, y)
Visualizing Model Selection
Plot GCV vs model complexity:
import matplotlib.pyplot as plt
from pymars.model import BackwardPass
# Get backward sequence
sequence = bp.run()
n_basis_list = []
gcv_scores = []
for model_info, gcv in sequence:
n_basis = len(model_info['basis_functions'])
n_basis_list.append(n_basis)
gcv_scores.append(gcv)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(n_basis_list, gcv_scores, 'o-', linewidth=2, markersize=8)
# Mark minimum
min_idx = np.argmin(gcv_scores)
plt.scatter(n_basis_list[min_idx], gcv_scores[min_idx],
color='red', s=200, marker='*', zorder=5,
label=f'Minimum at {n_basis_list[min_idx]} basis')
plt.xlabel('Number of Basis Functions')
plt.ylabel('GCV Score')
plt.title('Model Selection: GCV vs Complexity')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Common Issues
GCV Increases (No Improvement)
Problem: GCV goes up as you prune; backward pass doesn’t remove anything.
Causes: - Forward pass already found near-optimal model - max_terms too low - penalty too high
Solution:
# Reduce penalty (less aggressive pruning)
model = MARS(penalty=2.5)
GCV Unstable
Problem: GCV fluctuates wildly; hard to interpret.
Causes: - Small sample size - High condition number
Solution:
# Increase penalty (smooth out pruning)
# Or increase max_terms to get better selection
model = MARS(max_terms=50, penalty=3.5)
Overfitting Despite GCV
Problem: High training R² but low test R².
Causes: - GCV assumption violated - Noisy data with complicated true function
Solution:
# Use cross-validation penalty
# Or increase penalty
model = MARS(penalty=4.0)
# Always validate on test set
test_r2 = model.score(X_test, y_test)
Best Practices
Always standardize features – improves numerical stability
Use default penalty=3.0 – proven default for general models
Validate on test set – GCV is training estimate
Try multiple penalties – ensemble or sensitivity check
Plot GCV sequence – understand model selection process
Check residuals – ensure model assumptions met
Summary
Key Points:
✅ GCV automatically balances fit vs. complexity
✅ Forward pass overshoots (overfits); backward prunes
✅ Penalty parameter controls pruning severity
✅ Default penalty=3.0 works well
✅ Always validate final model on test data
✅ Cross-validation provides independent estimate
Formula to remember:
Where: