Advanced Topics

Numerical Solver Details

Least Squares Stability Chain

PyMARS uses a robust three-tier fallback strategy for solving least squares:

def solve_least_squares(B, y):
    """
    Solve: min_c ||y - B @ c||_2^2

    Uses fallback chain:
    1. QR decomposition (preferred, numerically stable)
    2. Cholesky on B^T B
    3. Pseudoinverse (most robust)
    """
    try:
        # Tier 1: QR decomposition
        Q, R = np.linalg.qr(B, mode='reduced')
        c = np.linalg.solve(R, Q.T @ y)

    except np.linalg.LinAlgError:
        try:
            # Tier 2: Cholesky
            L = np.linalg.cholesky(B.T @ B)
            c = np.linalg.solve(
                L.T,
                np.linalg.solve(L, B.T @ y)
            )
        except np.linalg.LinAlgError:
            # Tier 3: Pseudoinverse (always works)
            c = np.linalg.pinv(B) @ y

    return c

Why This Matters:

QR: Fast, backward stable, preferred
Cholesky: Medium speed, less stable
Pseudoinverse: Slowest but always works

Condition Number Management

For ill-conditioned problems:

import numpy as np

# Check condition number
BtB = B.T @ B
cond = np.linalg.cond(BtB)

if cond > 1e10:
    print("WARNING: Ill-conditioned problem")
    print(f"Condition number: {cond:.2e}")

    # Solutions:
    # 1. Use pseudoinverse
    c = np.linalg.pinv(B) @ y

    # 2. Standardize features (default in MARS)
    # 3. Remove collinear features
    # 4. Increase regularization (not in MARS)

Computational Complexity

Algorithm Complexity Analysis

Forward Pass:

\[O(M_{\max} \cdot d \cdot N \cdot |\text{knots}| \cdot N^2)\]

Breakdown: - \(M_{\max}\) iterations - \(d\) variables per iteration - \(N\) knot candidates per variable - \(N^2\) for least squares (per candidate)

For typical MARS:

Forward: \(O(30 \times 10 \times 200 \times 30 \times 40000) \approx 7.2 \times 10^{11}\) ops
But: Knot search is pruned, least squares is fast with QR

Backward Pass:

\[O(M \cdot M \cdot N^2)\]

Faster than forward, dominated by least squares refitting.

Total: O(1-2 minutes) for typical dataset (N=1000, d=20)

Memory Usage

Peak memory scales with:

\[\text{Memory} \approx 8 \times (N \times M + d \times M + \text{workspace})\]

For N=1000, M=50, d=20: - Design matrix B: 1000 × 50 = 50k elements = 400 KB - Coefficients: 50 elements = 400 B - QR workspace: ~20 MB - Total: ~20-30 MB

Optimization Strategies

For Large Datasets

If N > 10,000:

model = MARS(
    max_terms=20,          # Reduce complexity
    minspan=50,            # Fewer knots to test
    endspan=10,            # Fewer endpoint candidates
    verbose=True           # Monitor progress
)

For High-Dimensional Data

If d > 50:

# 1. Dimension reduction first
from sklearn.decomposition import PCA
pca = PCA(n_components=20)
X_reduced = pca.fit_transform(X)

# 2. Or use sparse MARS
model = MARS(
    max_terms=30,
    max_degree=1,          # No interactions for speed
    penalty=2.0            # More pruning
)

For Real-Time Prediction

# Prediction is fast: O(M) per sample
# M = basis functions (typically < 50)

# Example: 100,000 predictions in < 1 second
import time

start = time.time()
y_pred = model.predict(X_test)  # X_test.shape = (100000, 20)
elapsed = time.time() - start

rate = len(X_test) / elapsed
print(f"Prediction rate: {rate:.0f} samples/sec")
# Output: ~200,000-500,000 samples/sec

Caching Strategies

If Refitting Models Repeatedly

Cache basis function evaluations:

# Evaluate all bases once
B_cache = {}
for bf in model.basis_functions_:
    B_cache[bf.id] = bf.evaluate(X)

# Now comparisons are O(M) instead of O(M*N)
for penalty in [2.0, 2.5, 3.0]:
    # Use cached B instead of recomputing
    gcv = compute_gcv(B_cache, y, penalty)

Parallel Processing

Knot Search Parallelization

Forward pass can be parallelized:

from joblib import Parallel, delayed

def search_variable(X_j, y, current_basis):
    """Search all knots for one variable."""
    knots = get_candidate_knots(X_j, minspan, endspan)

    best_rss = float('inf')
    best_pair = None

    for t in knots:
        # Fit and evaluate pair
        # ... (as in forward pass)

    return best_pair, best_rss

# Parallel loop over variables
results = Parallel(n_jobs=-1)(
    delayed(search_variable)(X[:, j], y, current_basis)
    for j in range(X.shape[1])
)

Note: PyMARS doesn’t use built-in parallelization by default to maintain simplicity. Users can add it as shown above.

Debugging & Profiling

Verbose Output

model = MARS(verbose=True)
model.fit(X, y)

# Shows iteration progress

Profiling Fit Time

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

model = MARS()
model.fit(X, y)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 functions by time

Memory Profiling

from memory_profiler import profile

@profile
def fit_mars():
    model = MARS()
    model.fit(X, y)
    return model

fit_mars()

# Output shows memory usage per line

Edge Cases & Robustness

Constant Features

If a feature has no variance:

# Detection
if np.std(X[:, j]) < 1e-10:
    print(f"Feature {j} is constant, removing...")
    X = np.delete(X, j, axis=1)

Single Knot Problem

If all data values are identical:

# Add small noise or handle specially
if len(np.unique(X[:, j])) == 1:
    # Can't place knots; skip this variable
    pass

Singular Design Matrix

If B^T B is singular:

# Automatic fallback to pseudoinverse
# (handled in solve_least_squares)

# Or manually use pseudoinverse
c = np.linalg.pinv(B) @ y

Perfect Fit (RSS = 0)

If model fits perfectly (rare):

# GCV formula becomes 0 / 0
# Handled gracefully:

if RSS < 1e-10:
    gcv = 0.0  # Perfect fit
else:
    gcv = RSS / (N * (1 - df/N)**2)

Extension Points

Custom Basis Functions

Implement your own by subclassing:

from pymars.basis import HingeFunction

class SineHinge(HingeFunction):
    """Custom: sin(x) instead of hinge"""

    def __call__(self, X):
        x = X[:, self.variable]
        return np.sin(x - self.knot)

Custom Loss Functions

MARS uses squared loss by default. For custom losses:

def fit_with_custom_loss(X, y, loss_fn, loss_grad):
    """
    Gradient descent instead of least squares.
    (Not in standard MARS)
    """
    # Iterative refinement using loss_grad
    pass

Custom Penalty Functions

Instead of constant penalty \(d\):

def adaptive_penalty(M, n_interactions):
    """Penalty that increases with interactions."""
    base_penalty = 3.0
    interaction_penalty = 0.5 * n_interactions
    return base_penalty + interaction_penalty

Advanced: Fine-Tuning the Algorithm

Modifying Knot Density

Change how many knots are tested:

# Default: test all unique sorted values
# Custom: test every k-th value

def sparse_knot_candidates(X_j, k=5):
    """Test every k-th unique value."""
    unique_vals = np.unique(X_j)
    return unique_vals[::k]

Interaction Constraints

Add domain-specific constraints:

class ConstrainedMARS(MARS):
    """MARS with interaction constraints."""

    def _find_best_split(self, basis, y):
        """Override to add constraints."""

        # ... standard search ...

        # Check constraint before adding
        if self._is_valid_interaction(basis, new_hinge):
            # Keep candidate
            pass

Algorithm Variants

Boosting MARS

Stack multiple MARS models on residuals:

def boost_mars(X, y, n_boosters=5, learning_rate=0.1):
    """Boosted MARS ensemble."""

    residuals = y.copy()
    models = []

    for i in range(n_boosters):
        model = MARS(max_terms=10)
        model.fit(X, residuals)

        pred = model.predict(X)
        residuals = residuals - learning_rate * pred

        models.append(model)

    return models

def predict_boosted(models, X, learning_rate=0.1):
    """Predict with boosted ensemble."""

    y_pred = np.zeros(len(X))
    for model in models:
        y_pred += learning_rate * model.predict(X)

    return y_pred

Sparse MARS

Force sparsity via L1 penalty:

# Not built-in; requires modifying least squares
# Use Elastic Net instead of OLS:

from sklearn.linear_model import ElasticNet

def solve_sparse(B, y, alpha=0.1, l1_ratio=0.5):
    """Solve with L1 + L2 penalty."""

    en = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
    en.fit(B, y)
    return en.coef_

Summary

Key Advanced Topics:

✅ Numerical stability via fallback chain
✅ Complexity analysis and optimization
✅ Memory-efficient computation
✅ Caching and parallelization strategies
✅ Debugging and profiling tools
✅ Extension points for customization
✅ Algorithm variants (boosting, sparse)

When to Use Advanced Features:

Large-scale datasets: Optimize memory and time
Custom applications: Extend with domain knowledge
Research: Implement algorithm variants
Production: Add robustness and monitoring