---
title: "Getting Started with surveycore"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
bibliography: references.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{Getting Started with surveycore}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library(surveycore)
knitr::opts_chunk$set(
  comment = "#>",
  eval = requireNamespace("surveytidy", quietly = TRUE)
)
```

`surveycore` is a tidyverse-compatible alternative to `survey` and
`srvyr`, covering the full workflow for survey data analysis. The variance
estimation code for probability samples is vendored from `survey` and tested 
against it, so estimates using `as_survey()`, `as_survey_replicate()`, and 
`as_survey_twophase()` match the reference implementation. This vignette 
covers two things:

1. Creating survey objects

2. Conducting analysis

---

## Creating the survey object

The first step when conducting survey analysis is creating the right survey object
where we specify the sampling design, weights, and whatever other information is
needed. Without this information, point estimates may be biased and standard errors 
are almost certainly wrong [@lumley2010; @lohr2022].

Fortunately, we don't have to worry about that — that's what survey objects are
for. They give the analysis functions everything they need to correctly account
for variance and bias from the sampling design.

`surveycore` has four different survey object constructors:

1. `as_survey()`

2. `as_survey_replicate()`

3. `as_survey_nonprob()`

4. `as_survey_twophase()`

Rather than going into detail on each constructor, here is a quick overview of
each. For more information on the different constructors visit 
`vignette("creating-survey-objects")`.

### `as_survey()` 

Use `as_survey()` for two types of designs:

1. A stratified or clustered sample

2. A simple random sample (SRS)

If you know your data is a stratified/clustered sample or your data
comes with variables identifying the cluster IDs or strata, use this function.
All datasets used in this vignette are bundled with `surveycore`. In this first
example, we'll use the General Social Survey, which has variables for clustering,
strata, and design weights.

``` {r as_survey}
gss_svy <- as_survey(
  gss_2024,
  # the cluster ids
  ids = vpsu,
  # the strata
  strata = vstrat,
  # the weights
  weights = wtssps,
  nest = TRUE
)

gss_svy
```

Each survey object has a print method that shows the first 10 rows of the
data, similar to a tibble, but also includes a brief description of the
survey design.

If your data doesn't have strata or clusters, but each respondent had equal 
probability of being sampled (a simple random sample), then you still want
to use this function. However, unlike before, you leave `strata` and `ids`
NULL since you don't have any. A good example of this is the 2000 
California API survey.

```{r srs}
ca_api_2000_svy <- as_survey(
  ca_api_2000,
  weights = pw,
  fpc = fpc # reduces SEs
)

ca_api_2000_svy
```

### `as_survey_replicate()`

Use this when the data you have is from a probability sample and has replicate 
weight columns like `repwt_1`, `repwt_2`. For example, Pew's Jewish American 
study from 2020 uses replicate weights.


```{r replicate}
pew_jewish_svy <- as_survey_replicate(
  pew_jewish_2020,
  weights = extweight,
  repweights = extweight1:extweight100,
  type = "JK2"
)

pew_jewish_svy
```


### `as_survey_nonprob()`

Use this if your data comes from a non-probability sample (e.g., via an opt-in
panel like Qualtrics Panels, Cint/Lucid, Dynata, etc.) and
has weights (e.g., calibration weights, inverse-probability weights, etc.).
To illustrate we'll use Wave 1 from the Nationscape dataset.

```{r calibrated}
ns_wave1_svy <- as_survey_nonprob(ns_wave1, weights = weight)

ns_wave1_svy
```

### `as_survey_twophase()`

Two-phase sampling involves collecting a large initial sample, then sampling a
subset of those respondents as a follow-up. This is not a very common survey
method, but common examples include case-cohort studies, medical validation
studies, or surveys with a screening phase. If your data is a two-phase sample,
use `as_survey_twophase()`.

We will use the nwtco data from the `survival` package.

```{r nwtco, eval=requireNamespace("survival", quietly=TRUE)}
nwtco <- survival::nwtco

# in.subcohort is stored as 0/1 — must be logical for as_survey_twophase()
nwtco$in.subcohort <- as.logical(nwtco$in.subcohort)

# Phase 1: all 4,028 enrolled patients (each patient is their own unit)
phase1 <- as_survey(nwtco, ids = seqno)

# Phase 2: subcohort, with Phase 2 sampling stratified by relapse status
nwtco_svy <- as_survey_twophase(
  phase1,
  strata2 = rel, # Phase 2 strata: cases (rel=1) vs. non-cases (rel=0)
  subset = in.subcohort, # Logical column: TRUE = selected into Phase 2
  method = "full"
)

nwtco_svy
```


---

## Analysis functions

In addition to creating survey objects, `surveycore` has several 
functions designed to make analysis easier: 

- `get_freqs()`

- `get_means()`

- `get_totals()`

- `get_corr()`

- `get_ratios()`

- `get_quantiles()`

- `get_diffs()`

- `get_t_test()`

- `get_pairwise()`

- `get_variance()`

### Frequency tables — `get_freqs()`

`get_freqs()` calculates weighted frequencies (aka proportions). The 
first argument is the survey design, the second is the variable you want 
to get the frequencies for. Here's a simple example where we calculate 
whether people are willing to consider voting for Trump.

```{r freqs-basic}
get_freqs(ns_wave1_svy, consider_trump)
```


#### Analyzing multiple variables at once

A key piece of survey research involves select-all-that-apply style questions. 
For example, the Nationscape data asked people: "We're interested in where you
might have heard news about politics in the last week. Please indicate which of 
the following sources you used." Rather than looking at each one individually,
`get_freqs()` accepts `tidy-select` expressions, which allows you to pass in
multiple variables. Let's look at an example:

```{r freqs-multi}
get_freqs(ns_wave1_svy, c(news_sources_facebook:news_sources_other))
```

The `name` column identifies which variable each row belongs to; `value`
holds the response code. You can also change the name of the columns if
you want. For example:

```{r freqs-rename}
ns_wave1_svy |>
  get_freqs(
    c(news_sources_facebook:news_sources_other),
    names_to = "news_source",
    values_to = "choice"
  )
```


### Weighted means — `get_means()`

`get_means()` estimates the survey-weighted mean of a continuous variable.

```{r means-basic}
# Average favorability towards Biden
ns_wave1_svy |>
  # remove those who said "Not sure" (coded as 999)
  surveytidy::filter_out(cand_favorability_biden == 999) |>
  get_means(cand_favorability_biden)
```


### Population totals — `get_totals()`

`get_totals()` estimates the weighted total for the target population. When
called without `x`, it simply provides a sum of the weights. The meaning of
the result depends on how the weights are scaled. Pew's Jewish-American 
study scales the weights so it gives the estimated size of the Jewish-
American population:

```{r totals-pew}
pew_jewish_svy |>
  # only include jews by religion and jews of no religion to match Pew's report
  surveytidy::filter(jewishcat %in% c(1:2)) |>
  get_totals()
```

Compare that to the GSS data from earlier, where the weights are scaled to
the sample size (N = 3,309):

```{r totals-ns}
get_totals(gss_svy)
```

Specifying a variable in `x` computes the weighted total for that variable. 
To show this, we'll use the `ca_api_2000_svy` object from before to 
determine how many students are enrolled in the California API system.

```{r totals-x}
get_totals(ca_api_2000_svy, x = enroll)
```

To see the weighted total within each level of a categorical variable, 
use the `group` argument. To show this, we'll look at how how many
Jewish-Americans fall in each age category:

```{r totals-group}
pew_jewish_svy |>
  # only include jews by religion and jews of no religion to match Pew's report
  surveytidy::filter(jewishcat %in% c(1:2)) |>
  get_totals(group = age4cat)
```

---

### Weighted correlations — `get_corr()`

`get_corr()` estimates survey-weighted Pearson correlations between two or
more continuous variables. Confidence intervals use the Fisher Z
transformation, guaranteeing bounds in (−1, 1).

Let's look at favorability for Trump and Biden. First we clean the underlying
data frame using the `surveytidy` package by dropping rows with missing 
values and removing "Not sure" responses (coded `999`).

```{r corr-basic}
ns_wave1_clean_svy <- ns_wave1_svy |>
  surveytidy::drop_na(
    cand_favorability_trump,
    cand_favorability_biden
  ) |>
  surveytidy::filter_out(
    cand_favorability_trump == 999,
    cand_favorability_biden == 999
  )

get_corr(
  ns_wave1_clean_svy,
  c(cand_favorability_trump, cand_favorability_biden)
)
```

Next, let's look at favorability across multiple variables.

```{r corr-multi}
fav_vars <- c(
  "cand_favorability_trump",
  "cand_favorability_biden",
  "cand_favorability_harris",
  "cand_favorability_sanders",
  "cand_favorability_warren",
  "cand_favorability_buttigieg",
  "cand_favorability_pence"
)

ns_wave1_multi <- ns_wave1_clean_svy |>
  # remove NAs from all variables of interest
  surveytidy::drop_na(tidyselect::all_of(fav_vars)) |>
  # remove those who said "not sure" to any variable of interest
  surveytidy::filter_out(
    dplyr::if_any(
      tidyselect::all_of(fav_vars),
      \(x) x == 999
    )
  )

get_corr(
  ns_wave1_multi,
  c(cand_favorability_trump:cand_favorability_pence)
)
```

The output defaults to a long version where each row is a unique
variable pair. It shows the correlation in `r`, the confidence
intervals, p-values, and other relevant information.

Switch to wide format for a more familiar correlation-matrix layout:

```{r corr-wide}
get_corr(
  ns_wave1_multi,
  c(cand_favorability_trump:cand_favorability_pence),
  format = "wide"
)
```


---

### Ratio estimation — `get_ratios()`

`get_ratios()` estimates the ratio of two weighted totals. This is useful 
when you want an estimate that doesn't change relative to the scale of 
the weights, like wages per hour, spending per household member, or
disease prevalence ratios. 

We'll illustrate this with a less conventional example, comparing 
Trump's favorability to Biden's favorability. In this example, a score 
below 1 would mean that Trump is viewed more favorably, and a score above 
1 would mean Biden is viewed more favorably. We'll also use the 
`ns_wave1_multi` object from the `get_corr()` section since it already has
missing values and "Not sure" responses (999) removed.

```{r ratios-basic}
get_ratios(
  ns_wave1_multi,
  numerator = cand_favorability_trump,
  denominator = cand_favorability_biden
)
```

---

### Weighted quantiles — `get_quantiles()`

`get_quantiles()` estimates survey-weighted quantiles using the Woodruff (1952)
confidence interval method. Confidence intervals are derived by inverting the
weighted CDF rather than assuming normality, so they are generally **asymmetric**
around the estimate and always respect the range of the data. By default, it 
calculates the quantiles at the 25th, 50th, and 75th percentile.

```{r quantiles-basic}
# Quartiles and median of age (default probs = c(0.25, 0.5, 0.75))
get_quantiles(ns_wave1_svy, age)
```


---

### Treatment effects — `get_diffs()`

`get_diffs()` estimates the difference in means between each group and a
reference group using survey-weighted regression. Use it when you have a
categorical treatment variable with two or more levels and want to compare
each group against a baseline.

Here we estimate how Biden favorability differs by party identification.
The first factor level is used as the reference group by default; use
`ref_level` to change it.

```{r diffs-basic}
ns_wave1_svy |>
  surveytidy::filter_out(cand_favorability_biden == 999) |>
  get_diffs(cand_favorability_biden, treats = pid3)
```

Use `show_pct_change = TRUE` to add a column showing how much each group
differs from the reference mean in percentage terms:

```{r diffs-pct}
ns_wave1_svy |>
  surveytidy::filter_out(cand_favorability_biden == 999) |>
  get_diffs(
    cand_favorability_biden,
    treats = pid3,
    show_pct_change = TRUE
  )
```

---

### Two-sample t-test — `get_t_test()`

`get_t_test()` compares weighted means between exactly two groups using a
design-based t-test. The `by` variable must have exactly two levels.

```{r t-test-basic}
get_t_test(gss_svy, hrs1, by = sex)
```

The output includes the estimated difference, the mean for each group, 
standard error/confidence interval, t-statistic, degrees of freedom,
and p-value.

---

### All-pairs comparisons — `get_pairwise()`

When your grouping variable has more than two levels, `get_pairwise()`
runs all k(k−1)/2 pairwise t-tests in one call. P-values are adjusted
for multiple comparisons using the Holm method by default.

```{r pairwise-basic}
get_pairwise(ns_wave1_svy, age, by = pid3)
```

Each row is one pair of groups. Use `pval_adj` to change the correction
method: `"bonferroni"`, `"BH"`, `"none"`, etc.

---

### Population variance — `get_variance()`

`get_variance()` estimates the finite-population variance of a variable —
how spread out the variable is in the population, not the uncertainty of
the estimate. It accepts the same `group`, `variance`, and `n_weighted`
arguments as the other functions.

```{r variance-basic}
get_variance(ns_wave1_svy, age)
```


### Subgroup analysis — the `group` argument

Every analysis function accepts a `group` argument for computing estimates
separately within levels of a categorical variable. Pass a bare column name
or multiple using `c()`. For example, we'll look at Trump consideration
by party identification.

```{r group-means}
get_freqs(ns_wave1_svy, consider_trump, group = pid3)
```

Rows where the grouping variable is `NA` are excluded from all groups and do
not appear in the output. Responses within each group sum to 100% for
`get_freqs()`.

---

### Controlling uncertainty output

All analysis functions share a common `variance` argument. You can request
any combination of:

| Code   | What it returns                                     |
|--------|-----------------------------------------------------|
| `"se"` | Standard error                                      |
| `"ci"` | Confidence interval: `ci_low`, `ci_high`            |
| `"var"`| Variance (square of the SE)                         |
| `"cv"` | Coefficient of variation (SE / estimate)            |
| `"moe"`| Margin of error at `conf_level`                     |
| `"deff"` | Design effect (complex design variance / SRS variance) |

The `conf_level` argument controls the confidence level for `"ci"` and
`"moe"`. The default is `0.95`; for a 90% CI:

```{r variance-options}
get_means(
  ns_wave1_svy,
  age,
  variance = c("se", "ci", "moe"),
  conf_level = 0.9
)
```

Set `variance = NULL` to suppress all uncertainty columns and return point
estimates and sample counts only.

Add `n_weighted = TRUE` to include the estimated population count, the sum of
weights, alongside the unweighted sample count `n`. Using `get_freqs()` on
Pew's Jewish-Americans data, we can see both the proportion and the estimated
population size for each age category:

```{r n-weighted}
get_freqs(pew_jewish_svy, age4cat, n_weighted = TRUE)
```


---

## Regression

surveycore supports design-based regression via `survey_glm()`. It fits a
weighted generalized linear model with support for Gaussian (OLS), 
logistic, Poisson, and other methods, and returns a `survey_glm_fit` 
object.

```{r glm-fit}
fit <- gss_svy |>
  # convert race to a factor so one variable is a factor
  surveytidy::mutate(
    race_f = surveytidy::make_factor(race)
  ) |>
  survey_glm(hrs1 ~ sex + degree + age + race_f)

fit
```

Use `clean()` to tidy the output into a one-row-per-coefficient tibble with
estimates, standard errors, confidence intervals, and p-values:

```{r glm-clean}
clean(fit)
```

For logistic or Poisson models, pass `exponentiate = TRUE` to `clean()` to
report odds ratios or rate ratios instead of log-scale coefficients.

---

## Summary

| Function | Use for |
|---|---|
| `get_freqs()` | Categorical variables — weighted distributions, percentages |
| `get_means()` | Continuous variables — weighted means |
| `get_totals()` | Population counts or aggregates — weighted sums |
| `get_corr()` | Pairwise Pearson correlations |
| `get_ratios()` | Ratios of two weighted totals |
| `get_quantiles()` | Weighted quantiles and median — Woodruff CIs |
| `get_diffs()` | Group comparisons — treatment effects vs. a reference group |
| `get_t_test()` | Two-group mean comparison — design-based t-test |
| `get_pairwise()` | All-pairs t-tests with multiple-comparison adjustment |
| `get_variance()` | Finite-population variance of a continuous variable |
| `survey_glm()` + `clean()` | Design-based regression — OLS, logistic, Poisson |

All functions:
- Return a tibble subclass ready for further analysis or display
- Accept a `group` argument for subgroup estimates
- Accept a `variance` argument to control which uncertainty columns appear
- Handle all survey design classes: `survey_taylor`, `survey_replicate`,
  `survey_twophase`, and `survey_nonprob`