--- title: "Introduction to misl" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to misl} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, warning = FALSE, message = FALSE ) library(misl) ``` ## Overview `misl` implements **Multiple Imputation by Super Learning**, a flexible approach to handling missing data described in: > Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super > learning. *Statistical Methods in Medical Research*. 31(10):1904--1915. > doi: [10.1177/09622802221104238](https://doi.org/10.1177/09622802221104238) Rather than relying on a single parametric imputation model, `misl` builds a stacked ensemble of machine learning algorithms for each incomplete column, producing well-calibrated imputations for continuous, binary, categorical, and ordered categorical variables. ## Installation ```{r install, eval = FALSE} # Install from GitHub remotes::install_github("JustinManjourides/misl") # Optional backend packages for additional learners install.packages(c("ranger", "xgboost", "earth", "MASS")) ``` ## Simulated data We simulate a small dataset with four types of incomplete variables to demonstrate `misl` across all supported outcome types. ```{r simulate} library(misl) set.seed(42) n <- 300 sim_data <- data.frame( # Continuous predictors (always observed) age = round(rnorm(n, mean = 50, sd = 12)), bmi = round(rnorm(n, mean = 26, sd = 4), 1), # Continuous outcome with missingness sbp = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) + 0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)), # Binary outcome with missingness (0 = no, 1 = yes) smoker = rbinom(n, 1, prob = 0.3), # Unordered categorical outcome with missingness group = factor(sample(c("A", "B", "C"), n, replace = TRUE, prob = c(0.4, 0.35, 0.25))), # Ordered categorical outcome with missingness health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n, replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)), levels = c("Poor", "Fair", "Good", "Excellent"), ordered = TRUE) ) # Introduce missing values sim_data[sample(n, 40), "sbp"] <- NA sim_data[sample(n, 30), "smoker"] <- NA sim_data[sample(n, 30), "group"] <- NA sim_data[sample(n, 30), "health"] <- NA # Summarise missingness sapply(sim_data, function(x) sum(is.na(x))) ``` ## Built-in learners Use `list_learners()` to see the available named learners, optionally filtered by outcome type: ```{r list-learners} knitr::kable(list_learners()) ``` ```{r list-learners-ordinal} knitr::kable(list_learners("ordinal")) ``` ## Basic usage The primary function is `misl()`. Supply a dataset and specify: - `m` -- the number of multiply imputed datasets to create - `maxit` -- the number of Gibbs sampling iterations per dataset - `con_method`, `bin_method`, `cat_method`, `ord_method` -- learners for each outcome type Running `misl()` with a single named learner per outcome type is the fastest option and is well suited for exploratory work. Note that ordered factors are automatically detected and routed to `ord_method`: ```{r basic-usage, eval = FALSE} misl_imp <- misl( sim_data, m = 5, maxit = 5, con_method = "glm", bin_method = "glm", cat_method = "multinom_reg", ord_method = "polr", seed = 42, quiet = TRUE ) ``` `misl()` returns a list of length `m`. Each element contains: - `$datasets` -- the fully imputed tibble - `$trace` -- a long-format tibble of convergence statistics ```{r inspect-output, eval = FALSE} # Number of imputed datasets length(misl_imp) # Confirm no missing values remain anyNA(misl_imp[[1]]$datasets) # Confirm ordered factor is preserved is.ordered(misl_imp[[1]]$datasets$health) levels(misl_imp[[1]]$datasets$health) ``` ## Custom learners via parsnip In addition to the built-in named learners, `misl` v2.0 accepts any parsnip-compatible model spec directly. This allows you to use any learner available in the tidymodels ecosystem without waiting for it to be added to the built-in registry. ### Passing a custom spec Simply pass a parsnip model spec in place of (or alongside) a named string. `misl` will automatically enforce the correct mode (regression vs classification) regardless of what is set on the spec: ```{r custom-spec, eval = FALSE} library(parsnip) # A random forest with custom hyperparameters custom_rf <- rand_forest(trees = 500, mtry = 3) |> set_engine("ranger") misl_custom <- misl( sim_data, m = 5, maxit = 5, con_method = list(custom_rf), bin_method = list(custom_rf), cat_method = "multinom_reg", ord_method = "polr", seed = 42, quiet = TRUE ) ``` ### Mixing named learners and custom specs Named strings and parsnip specs can be freely mixed in the same list. When multiple learners are supplied, `misl` uses cross-validation to build a stacked ensemble: ```{r mixed-learners, eval = FALSE} library(parsnip) # Mix a named learner with a custom tuned spec custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |> set_engine("xgboost") misl_mixed <- misl( sim_data, m = 5, maxit = 5, con_method = list("glm", custom_xgb), bin_method = list("glm", custom_xgb), cat_method = list("multinom_reg", "rand_forest"), ord_method = list("polr", "rand_forest"), cv_folds = 3, seed = 42 ) ``` ### Using a learner not in the built-in registry Any parsnip-supported engine can be used. For example, a support vector machine via the `kernlab` package: ```{r svm-learner, eval = FALSE} library(parsnip) # SVM - not in the built-in registry but works via parsnip svm_spec <- svm_rbf() |> set_engine("kernlab") misl_svm <- misl( sim_data, m = 5, maxit = 5, con_method = list("glm", svm_spec), bin_method = list("glm", svm_spec), cat_method = "multinom_reg", ord_method = "polr", cv_folds = 3, seed = 42 ) ``` ### Ordinal outcomes and the polr learner For ordered categorical variables, `misl` automatically detects ordered factors and routes them to `ord_method`. The default learner is `"polr"` (proportional odds logistic regression from the `MASS` package), which respects the ordering of the levels: ```{r ordinal-example, eval = FALSE} # Ensure your ordered variable is an ordered factor sim_data$health <- factor(sim_data$health, levels = c("Poor", "Fair", "Good", "Excellent"), ordered = TRUE ) misl_ordinal <- misl( sim_data, m = 5, maxit = 5, con_method = "glm", bin_method = "glm", cat_method = "multinom_reg", ord_method = "polr", # proportional odds model for ordered categories seed = 42, quiet = TRUE ) # Imputed values respect the ordering is.ordered(misl_ordinal[[1]]$datasets$health) levels(misl_ordinal[[1]]$datasets$health) ``` ## Multiple learners and stacking When multiple learners are supplied (whether named strings, parsnip specs, or a mix), `misl` uses cross-validation to learn optimal ensemble weights via the `stacks` package. Use `cv_folds` to reduce the number of folds and speed up computation: ```{r multi-learner, eval = FALSE} misl_stack <- misl( sim_data, m = 5, maxit = 5, con_method = c("glm", "rand_forest"), bin_method = c("glm", "rand_forest"), cat_method = c("multinom_reg", "rand_forest"), ord_method = c("polr", "rand_forest"), cv_folds = 3, seed = 42 ) ``` ## Analysing the imputed datasets After imputation, fit your analysis model to each of the `m` datasets and pool the results using Rubin's rules. Here we implement pooling manually using base R: ```{r pool-results, eval = FALSE} # Fit a linear model to each imputed dataset models <- lapply(misl_imp, function(imp) { lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets) }) # Pool point estimates and standard errors using Rubin's rules m <- length(models) ests <- sapply(models, function(fit) coef(fit)) vars <- sapply(models, function(fit) diag(vcov(fit))) Q_bar <- rowMeans(ests) # pooled estimate U_bar <- rowMeans(vars) # within-imputation variance B <- apply(ests, 1, var) # between-imputation variance T_total <- U_bar + (1 + 1 / m) * B # total variance pooled <- data.frame( term = names(Q_bar), estimate = round(Q_bar, 4), std.error = round(sqrt(T_total), 4), conf.low = round(Q_bar - 1.96 * sqrt(T_total), 4), conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4) ) print(pooled) ``` For a full implementation of Rubin's rules including degrees of freedom and p-values, the `mice` package provides `pool()` and can be used directly with `misl` output: ```{r pool-mice, eval = FALSE} library(mice) pooled_mice <- summary(pool(models), conf.int = TRUE) ``` ## Convergence: trace plots The `plot_misl_trace()` function plots the mean or standard deviation of imputed values across iterations for a given variable, with one line per imputed dataset. Stable traces that mix well across datasets indicate convergence. Note that trace statistics are only computed for continuous and numeric binary columns — they are not available for categorical or ordinal columns. ```{r trace-plot, eval = FALSE} # Plot mean of imputed sbp values across iterations for each dataset plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)") ``` ```{r trace-plot-sd, eval = FALSE} # Plot the standard deviation instead plot_misl_trace(misl_imp, variable = "sbp", statistic = "sd") ``` Stable traces that mix well across datasets indicate the algorithm has converged. ## Parallelism The `m` imputed datasets are generated independently and can be run in parallel using the `future` framework. Set a parallel plan before calling `misl()`: ```{r parallel, eval = FALSE} library(future) # Use all available cores plan(multisession) misl_parallel <- misl( sim_data, m = 10, maxit = 5, con_method = c("glm", "rand_forest"), bin_method = c("glm", "rand_forest"), cat_method = c("multinom_reg", "rand_forest"), ord_method = c("polr", "rand_forest"), seed = 42 ) # Always reset the plan when done plan(sequential) ``` To limit the number of cores: ```{r parallel-limited, eval = FALSE} plan(multisession, workers = 4) ``` The largest speedup comes from running the `m` datasets simultaneously. Check how many cores are available with: ```{r detect-cores, eval = FALSE} parallel::detectCores() ``` ## Session info ```{r session-info} sessionInfo() ```