--- title: "Scaling KRLS to larger n: the Nystrom approximation" author: "Jens Hainmueller and Chad Hazlett" date: "`r format(Sys.Date(), '%B %Y')`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Scaling KRLS to larger n: the Nystrom approximation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5, dpi = 96 ) set.seed(2026) ``` ## Motivation Exact KRLS solves a Tikhonov problem on the full $n \times n$ Gaussian kernel. Time is $O(n^3)$ and memory is $O(n^2)$, so the practical ceiling on a laptop is around $n \approx 5{,}000$. For larger samples the kernel allocation alone becomes prohibitive. The Nystrom approximation replaces the full kernel with a low-rank representation anchored at $m$ landmark points. The fitted model is $\hat f(x) = K(x, Z)\, \alpha$ where $Z$ are the landmarks and $\alpha$ is a length-$m$ coefficient vector solved in $m$-dimensional space. Time becomes $O(n m^2 + m^3)$ and memory $O(n m)$, so KRLS can now handle sample sizes well past the exact ceiling. KRLS exposes this as an explicit approximation: ```r krls(X, y, approx = "nystrom", nystrom_m = 100) ``` Default landmark count is `min(500, n)`. Inference (when `vcov = TRUE`) is *conditional approximate* -- standard errors condition on the selected landmarks, fixed $\lambda$, and the low-rank feature map; they are not equivalent to exact KRLS inference. ```{r setup} library(KRLS) ``` ## A scaling comparison We simulate a smooth function in three predictors plus a binary indicator, and compare exact KRLS to Nystrom across modest $n$. ```{r dgp} make_data <- function(n) { X <- cbind(rnorm(n), rnorm(n), rnorm(n), rbinom(n, 1, 0.4)) colnames(X) <- c("x1", "x2", "x3", "x4") f <- sin(X[, 1]) + 0.3 * X[, 2] + 0.05 * X[, 3]^2 + 0.5 * X[, 4] list(X = X, y = f + rnorm(n, sd = 0.3), truth = f) } ``` For each $n$ we fit both paths, record wall-clock time, in-sample $R^2$, and prediction RMSE on a held-out test set drawn from the same distribution. Predictions are evaluated against the true mean function (no noise on the test target) to focus on approximation error. ```{r bench, cache = FALSE} sizes <- c(500, 1000) res <- vector("list", length(sizes)) for (i in seq_along(sizes)) { n <- sizes[i] tr <- make_data(n) te <- make_data(200) # Use the package default landmark count, min(500, n). m <- min(500L, n) t_e <- system.time({ fit_e <- krls(tr$X, tr$y, approx = "none", derivative = FALSE, vcov = FALSE, print.level = 0) })["elapsed"] t_n <- system.time({ fit_n <- krls(tr$X, tr$y, approx = "nystrom", derivative = FALSE, print.level = 0) })["elapsed"] pred_e <- predict(fit_e, newdata = te$X)$fit pred_n <- predict(fit_n, newdata = te$X)$fit res[[i]] <- data.frame( n = n, m = m, time_exact = round(as.numeric(t_e), 2), time_nys = round(as.numeric(t_n), 3), speedup = round(as.numeric(t_e) / as.numeric(t_n), 1), rmse_exact = round(sqrt(mean((pred_e - te$truth)^2)), 4), rmse_nys = round(sqrt(mean((pred_n - te$truth)^2)), 4) ) } do.call(rbind, res) ``` The Nystrom path stays in milliseconds while exact KRLS grows cubically. Prediction RMSE on held-out data is comparable -- Nystrom trades a small amount of approximation error for a large amount of runtime. ## Scaling beyond what exact KRLS can do Past $n \approx 2{,}000$ the exact path becomes uncomfortable on a typical laptop. Nystrom keeps going: ```{r big} n <- 2000 tr <- make_data(n) te <- make_data(200) # Default is min(500, n) -- here 500 landmarks at n = 2000. m <- min(500L, n) t_n <- system.time({ fit_big <- krls(tr$X, tr$y, approx = "nystrom", derivative = FALSE, print.level = 0) })["elapsed"] pred_big <- predict(fit_big, newdata = te$X)$fit sprintf("n = %d, m = %d, time = %.3fs, R2 = %.3f, RMSE = %.3f", n, m, t_n, fit_big$R2, sqrt(mean((pred_big - te$truth)^2))) ``` The fit still recovers the truth at held-out RMSE comparable to the exact path on the smaller datasets, in a fraction of a second. ## Reusing landmarks across fits When running sensitivity checks -- refitting on the same predictors with several outcome variants, or perturbing $y$ -- it is convenient to fix the landmark set so the approximation geometry is held constant. The fitted object stores landmarks in standardized $X$-space; to pass them back through `krls()` use `get_landmarks()`, which un-standardizes to the original units the function expects: ```{r reuse} tr <- make_data(400) fit <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 30, derivative = FALSE, print.level = 0) Z <- get_landmarks(fit) # original X-scale matrix y_perturbed <- tr$y + rnorm(400, sd = 0.05) fit2 <- krls(tr$X, y_perturbed, approx = "nystrom", landmarks = Z, derivative = FALSE, print.level = 0) # Same landmarks under the hood identical(unname(fit$landmarks), unname(fit2$landmarks)) ``` The landmark coordinates are bit-identical, so any difference between the two fits is attributable to the change in $y$ alone -- not to re-selecting anchor points. ## Choosing the lambda criterion KRLS picks the regularization parameter by minimizing a closed-form objective. Two are available: * `lambda_method = "loo"` (the default) -- leave-one-out squared error. * `lambda_method = "gcv"` -- generalized cross-validation, $RSS(\lambda) / (1 - \mathrm{tr}(S(\lambda))/n)^2$. For the Nystrom path both are evaluated in $O(nm)$ per candidate $\lambda$ from the cached SVD, so the choice is essentially free. ```{r gcv} tr <- make_data(500) fit_loo <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 25, derivative = FALSE, lambda_method = "loo", print.level = 0) fit_gcv <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 25, derivative = FALSE, lambda_method = "gcv", print.level = 0) data.frame( method = c("loo", "gcv"), lambda = c(fit_loo$lambda, fit_gcv$lambda), R2 = c(fit_loo$R2, fit_gcv$R2) ) ``` LOO and GCV typically pick lambdas within a small multiplicative factor of each other. GCV is sometimes preferred when one or two observations have disproportionate leverage; LOO has more historical use in KRLS applications. ## What carries over from the exact path A Nystrom fit is still a `krls` object, so the standard accessors work: * `predict(fit, newdata = ...)` -- predictions; pass `se.fit = TRUE` for conditional approximate standard errors when the fit was built with `vcov = TRUE`. * `summary(fit)` -- now prints an *Approximation* block reporting `m`, the landmark method, inference type, the landmark-kernel condition number, and the count of eigenvalues at the relative-ridge floor. * `tidy(fit)`, `glance(fit)`, `augment(fit)` -- broom methods are Nystrom-aware. `glance()` reports `approx`, `nystrom_m`, `inference`, and the effective degrees of freedom. ## Limitations * Standard errors are *conditional* on the selected landmarks, fixed $\lambda$, and the low-rank feature approximation. They are not equivalent to exact KRLS standard errors and should not be used in contexts that assume exact inference. * Only the Gaussian kernel is supported under `approx = "nystrom"`. * `kmeans` landmark selection is approximately reproducible under `set.seed()` but not bit-stable across R versions; for strict reproducibility use the default random sampling or supply landmarks explicitly.