---
title: "Introduction to misl"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to misl}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = TRUE,
  warning   = FALSE,
  message   = FALSE
)
library(misl)
```

## Overview

`misl` implements **Multiple Imputation by Super Learning**, a flexible approach
to handling missing data described in:

> Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super
> learning. *Statistical Methods in Medical Research*. 31(10):1904--1915.
> doi: [10.1177/09622802221104238](https://doi.org/10.1177/09622802221104238)

Rather than relying on a single parametric imputation model, `misl` builds a
stacked ensemble of machine learning algorithms for each incomplete column,
producing well-calibrated imputations for continuous, binary, categorical, and
ordered categorical variables.

## Installation

```{r install, eval = FALSE}
# Install from GitHub
remotes::install_github("JustinManjourides/misl")

# Optional backend packages for additional learners
install.packages(c("ranger", "xgboost", "earth", "MASS"))
```

## Simulated data

We simulate a small dataset with four types of incomplete variables to
demonstrate `misl` across all supported outcome types.

```{r simulate}
library(misl)

set.seed(42)
n <- 300

sim_data <- data.frame(
  # Continuous predictors (always observed)
  age    = round(rnorm(n, mean = 50, sd = 12)),
  bmi    = round(rnorm(n, mean = 26, sd = 4), 1),
  # Continuous outcome with missingness
  sbp    = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
                       0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
  # Binary outcome with missingness (0 = no, 1 = yes)
  smoker = rbinom(n, 1, prob = 0.3),
  # Unordered categorical outcome with missingness
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE,
                         prob = c(0.4, 0.35, 0.25))),
  # Ordered categorical outcome with missingness
  health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n,
                         replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)),
                  levels  = c("Poor", "Fair", "Good", "Excellent"),
                  ordered = TRUE)
)

# Introduce missing values
sim_data[sample(n, 40), "sbp"]    <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"]  <- NA
sim_data[sample(n, 30), "health"] <- NA

# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))
```

## Built-in learners

Use `list_learners()` to see the available named learners, optionally filtered
by outcome type:

```{r list-learners}
knitr::kable(list_learners())
```

```{r list-learners-ordinal}
knitr::kable(list_learners("ordinal"))
```

## Basic usage

The primary function is `misl()`. Supply a dataset and specify:

- `m` -- the number of multiply imputed datasets to create
- `maxit` -- the number of Gibbs sampling iterations per dataset
- `con_method`, `bin_method`, `cat_method`, `ord_method` -- learners for each
  outcome type

Running `misl()` with a single named learner per outcome type is the fastest
option and is well suited for exploratory work. Note that ordered factors are
automatically detected and routed to `ord_method`:

```{r basic-usage, eval = FALSE}
misl_imp <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)
```

`misl()` returns a list of length `m`. Each element contains:

- `$datasets` -- the fully imputed tibble
- `$trace`    -- a long-format tibble of convergence statistics

```{r inspect-output, eval = FALSE}
# Number of imputed datasets
length(misl_imp)

# Confirm no missing values remain
anyNA(misl_imp[[1]]$datasets)

# Confirm ordered factor is preserved
is.ordered(misl_imp[[1]]$datasets$health)
levels(misl_imp[[1]]$datasets$health)
```

## Custom learners via parsnip

In addition to the built-in named learners, `misl` v2.0 accepts any
parsnip-compatible model spec directly. This allows you to use any learner
available in the tidymodels ecosystem without waiting for it to be added to the
built-in registry.

### Passing a custom spec

Simply pass a parsnip model spec in place of (or alongside) a named string.
`misl` will automatically enforce the correct mode (regression vs
classification) regardless of what is set on the spec:

```{r custom-spec, eval = FALSE}
library(parsnip)

# A random forest with custom hyperparameters
custom_rf <- rand_forest(trees = 500, mtry = 3) |>
  set_engine("ranger")

misl_custom <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list(custom_rf),
  bin_method = list(custom_rf),
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)
```

### Mixing named learners and custom specs

Named strings and parsnip specs can be freely mixed in the same list. When
multiple learners are supplied, `misl` uses cross-validation to build a stacked
ensemble:

```{r mixed-learners, eval = FALSE}
library(parsnip)

# Mix a named learner with a custom tuned spec
custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |>
  set_engine("xgboost")

misl_mixed <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", custom_xgb),
  bin_method = list("glm", custom_xgb),
  cat_method = list("multinom_reg", "rand_forest"),
  ord_method = list("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)
```

### Using a learner not in the built-in registry

Any parsnip-supported engine can be used. For example, a support vector machine
via the `kernlab` package:

```{r svm-learner, eval = FALSE}
library(parsnip)

# SVM - not in the built-in registry but works via parsnip
svm_spec <- svm_rbf() |>
  set_engine("kernlab")

misl_svm <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", svm_spec),
  bin_method = list("glm", svm_spec),
  cat_method = "multinom_reg",
  ord_method = "polr",
  cv_folds   = 3,
  seed       = 42
)
```

### Ordinal outcomes and the polr learner

For ordered categorical variables, `misl` automatically detects ordered factors
and routes them to `ord_method`. The default learner is `"polr"` (proportional
odds logistic regression from the `MASS` package), which respects the ordering
of the levels:

```{r ordinal-example, eval = FALSE}
# Ensure your ordered variable is an ordered factor
sim_data$health <- factor(sim_data$health,
  levels  = c("Poor", "Fair", "Good", "Excellent"),
  ordered = TRUE
)

misl_ordinal <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",   # proportional odds model for ordered categories
  seed       = 42,
  quiet      = TRUE
)

# Imputed values respect the ordering
is.ordered(misl_ordinal[[1]]$datasets$health)
levels(misl_ordinal[[1]]$datasets$health)
```

## Multiple learners and stacking

When multiple learners are supplied (whether named strings, parsnip specs, or a
mix), `misl` uses cross-validation to learn optimal ensemble weights via the
`stacks` package. Use `cv_folds` to reduce the number of folds and speed up
computation:

```{r multi-learner, eval = FALSE}
misl_stack <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)
```

## Analysing the imputed datasets

After imputation, fit your analysis model to each of the `m` datasets and pool
the results using Rubin's rules. Here we implement pooling manually using base R:

```{r pool-results, eval = FALSE}
# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
  lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets)
})

# Pool point estimates and standard errors using Rubin's rules
m       <- length(models)
ests    <- sapply(models, function(fit) coef(fit))
vars    <- sapply(models, function(fit) diag(vcov(fit)))

Q_bar   <- rowMeans(ests)                          # pooled estimate
U_bar   <- rowMeans(vars)                          # within-imputation variance
B       <- apply(ests, 1, var)                     # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B                # total variance

pooled <- data.frame(
  term      = names(Q_bar),
  estimate  = round(Q_bar, 4),
  std.error = round(sqrt(T_total), 4),
  conf.low  = round(Q_bar - 1.96 * sqrt(T_total), 4),
  conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)
```

For a full implementation of Rubin's rules including degrees of freedom and
p-values, the `mice` package provides `pool()` and can be used directly with
`misl` output:

```{r pool-mice, eval = FALSE}
library(mice)
pooled_mice <- summary(pool(models), conf.int = TRUE)
```

## Convergence: trace plots

The `plot_misl_trace()` function plots the mean or standard deviation of
imputed values across iterations for a given variable, with one line per
imputed dataset. Stable traces that mix well across datasets indicate
convergence. Note that trace statistics are only computed for continuous
and numeric binary columns — they are not available for categorical or
ordinal columns.

```{r trace-plot, eval = FALSE}
# Plot mean of imputed sbp values across iterations for each dataset
plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)")
```

```{r trace-plot-sd, eval = FALSE}
# Plot the standard deviation instead
plot_misl_trace(misl_imp, variable = "sbp", statistic = "sd")
```

Stable traces that mix well across datasets indicate the algorithm has
converged.

## Parallelism

The `m` imputed datasets are generated independently and can be run in parallel
using the `future` framework. Set a parallel plan before calling `misl()`:

```{r parallel, eval = FALSE}
library(future)

# Use all available cores
plan(multisession)

misl_parallel <- misl(
  sim_data,
  m          = 10,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  seed       = 42
)

# Always reset the plan when done
plan(sequential)
```

To limit the number of cores:

```{r parallel-limited, eval = FALSE}
plan(multisession, workers = 4)
```

The largest speedup comes from running the `m` datasets simultaneously. Check
how many cores are available with:

```{r detect-cores, eval = FALSE}
parallel::detectCores()
```

## Session info

```{r session-info}
sessionInfo()
```