--- title: "Building augmented data for multi-state models" subtitle: "The **msmtools** workflow" author: | | Francesco Grossetti | francesco.grossetti@unibocconi.it date: "`r Sys.Date()`" output: rmarkdown::html_vignette: number_sections: yes toc: yes toc_depth: 3 bibliography: references.bib vignette: > %\VignetteIndexEntry{Building augmented data for multi-state models} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown_notangle} --- ```{r setup, message = FALSE} library(msmtools) library(data.table) library(msm) ``` # Overview **msmtools** prepares longitudinal data for multi-state models fitted with **msm** [@jackson2011multi; @msm_cran]. The package exposes four public functions: * `augment()` builds transition-level data from repeated observations; * `polish()` removes subjects with incompatible transitions at the same time; * `survplot()` compares fitted and empirical survival curves; * `prevplot()` compares observed and expected state prevalences. The examples below use the bundled `hosp` dataset. It contains synthetic hospital admissions for 10 subjects. ```{r data-preview} data(hosp) hosp[1:6, .(subj, adm_number, gender, age, label_3, dateIN, dateOUT, dateCENS)] ``` # Data Augmentation `augment()` adds one row per transition endpoint and creates status variables that can be used directly in an **msm** model. ```{r augment} hosp_augmented <- augment( data = copy(hosp), data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) hosp_augmented[ 1:8, .(subj, adm_number, label_3, augmented, augmented_int, status, status_num) ] ``` When the input time columns are `Date` values, `augment()` keeps the date-valued transition time and adds an integer version. This is useful because **msm** works with numeric time scales. ```{r augmented-columns} names(hosp_augmented) ``` # Outcome Schema And Generated States `pattern` and `state` describe different parts of the augmentation. `pattern` is the terminal outcome schema observed in the input data. It can have two values, alive and dead, or three values, alive, dead during a transition, and dead after a transition. `state` is the generated transition-state vocabulary. It must always contain three labels: the state at `t_start`, the state at `t_end`, and the absorbing state. This is why a two-value `pattern` still needs three `state` labels: `augment()` uses the event times to infer whether death maps to the absorbing state inside or outside the transition window. By default, `augment()` uses `copy = FALSE` and follows **data.table** by-reference semantics. This avoids unnecessary memory use on large longitudinal datasets, but the input object can have its key changed and `n_events` can be created when the argument is omitted. Use `copy = TRUE` when the original input must remain unchanged. # Duplicate Transition Cleanup `polish()` removes entire subjects when different transitions occur at the same time. The bundled data do not contain such conflicts, so this call leaves the data unchanged. It also uses `copy = FALSE` by default; set `copy = TRUE` when the original augmented data should not be keyed or otherwise touched by reference. ```{r polish} hosp_clean <- polish( data = copy(hosp_augmented), data_key = subj, pattern = label_3 ) nrow(hosp_augmented) nrow(hosp_clean) ``` # Survival Plot The plotting helpers work on fitted **msm** objects. This example uses a compact three-state transition matrix matching the default `augment()` state labels. ```{r fit-model} Qmat <- matrix(0, nrow = 3, ncol = 3, byrow = TRUE) Qmat[1, 1:3] <- 1 Qmat[2, 1:3] <- 1 colnames(Qmat) <- c("IN", "OUT", "DEAD") rownames(Qmat) <- c("IN", "OUT", "DEAD") msm_model <- msm( status_num ~ augmented_int, subject = subj, data = hosp_augmented, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = "BFGS", control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000) ) ``` ```{r survival-plot, fig.width = 7, fig.height = 4} surv_p <- survplot(msm_model, km = TRUE, grid = 10) surv_p ``` The fitted and Kaplan-Meier data tables are attached to the plot as named fields, accessible with the standard `$` operator: ```{r survival-data} surv_p$fitted[1:6] surv_p$km[1:6] ``` # Prevalence Plot `prevplot()` uses the output of `msm::prevalence.msm()` and returns a `ggplot` object. ```{r prevalence-plot, fig.width = 7, fig.height = 4} prev <- prevalence.msm( msm_model, covariates = "mean", ci = "normal", times = seq( min(hosp_augmented$augmented_int), max(hosp_augmented$augmented_int), length.out = 6 ) ) prev_p <- prevplot(msm_model, prev, ci = TRUE, M = FALSE) prev_p ``` The long-format prevalence data used to build the plot is attached as `$prevalence`: ```{r prevalence-data} prev_p$prevalence[1:6] ``` # Notes The current 2.x series keeps the public API stable while modernizing dependencies, documentation, tests, and CI. Larger internal changes to `augment()` are intentionally deferred until after the maintenance releases.