--- title: "Local estimates for ecological inference" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Local estimates for ecological inference} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The main estimation routine in **seine**, `ei_est()`, produces _global_ estimates, $\mathbb{E}[Y\mid X]$ for the entire population under study. Sometimes, however, estimates for each individual aggregation unit (such as precincts, counties, or other geographies) are of interest. **seine** provides two approaches for this. `ei_bounds()` computes guaranteed-valid partial identification (Duncan-Davis) bounds for each unit, relying only on the accounting identity and known limits on the outcome. `ei_est_local()` produces point estimates and confidence intervals under the same coarsening at random (CAR) assumption that `ei_est()` relies on, plus (for now) an additional homoskedasticity assumption. # Setting up the analysis We begin by loading the 1968 election data, and defining an `ei_spec` object that records the outcome, predictor, covariate, and total-count columns, following the setup from `vignette("seine")`. We now want to estimate the individual-level association between race and presidential vote choice _within_ each county. ```{r setup} library(seine) data(elec_1968) ``` For the bounds, only the data are needed. For `ei_est_local()`, we also need a fitted regression model, so we set up an `ei_spec` object and call `ei_ridge()`. ```{r} spec = ei_spec( elec_1968, predictors = vap_white:vap_other, outcome = pres_dem_hum:pres_abs, total = pres_total, covariates = c(state, pop_city:pop_rural, farm:educ_coll, inc_00_03k:inc_25_99k), preproc = function(x) { x = model.matrix(~ 0 + ., x) # convert factors to dummies bases::b_bart(x, trees = 200) } ) m = ei_ridge(spec) ``` See the main vignette (`vignette("seine")`) for a full walkthrough of the estimation workflow. # Partial identification bounds The simplest local quantities to compute are partial identification bounds: for each county and each predictor-outcome combination, the range of values for the local estimand $\beta_{gj} = \Pr(Y = j \mid X = x_g, G = g)$ that is consistent with the observed aggregate data. These bounds require no modeling assumptions beyond the accounting identity and the specified bounds on the outcome. `ei_bounds()` takes an `ei_spec` object (or a formula) and returns a data frame with one row per unit per predictor-outcome combination. The `bounds` argument specifies the known limits on the outcome, which are typically [0, 1] for proportions. Without known bounds on the outcome or other statistical assumptions, the local estimates are completely unidentified. ```{r} bounds = ei_bounds(spec, bounds = c(0, 1)) head(bounds) ``` The `min` and `max` columns give the sharp bounds for each unit. The `weight` column contains the number of observations in that unit and predictor group. This column can be used to aggregate estimates across units. Since aggregating bounds is a common operation, `ei_bounds()` provides the `global = TRUE` argument to automatically compute the global bounds by taking a weighted average of the local bounds. ```{r} ei_bounds(spec, bounds = c(0, 1), global = TRUE) ``` As is common with partial identification, the bounds can be quite wide. For example, without further assumptions, nothing can be said about the Black preference for Wallace except that it is between 0.8% and 94.6%. The `as.array()` method reformats the output as a four-dimensional array with dimensions units by predictors by outcomes by (min, max), which may be convenient for further analysis. ```{r} dim(as.array(bounds)) ``` Bounds on contrasts, such as the difference in vote share between White and Black voters, can be computed directly by passing a `contrast` argument. Computing these bounds is more involved and requires solving a linear program for each unit, but the output is the same format as above. Here, we calculate bounds on White-Black racially polarized voting for each candidate. ```{r} ei_bounds(spec, bounds = c(0, 1), contrast = list(predictor = c(1, -1, 0))) ``` Note that because contrasts _across_ predictor groups involve fundamentally different numbers of people in each unit, these bounds cannot be aggregated across units in a meaningful way. For a bound on the global contrast, you can aggregate non-contrasted bounds and then take the contrast of the global bounds. # Estimating the local covariance Point estimates and confidence intervals from `ei_est_local()` require a model for how the local estimands vary around their conditional mean within each unit. This variation is captured by the `b_cov` argument, a covariance matrix for the local estimands $\beta_{gj}$ around the regression prediction. `ei_local_cov()` estimates this covariance from the data, under the CAR assumption and an additional homoskedasticity condition. A small amount of shrinkage is applied, controlled by the `prior_obs` argument. Details are provided in McCartan & Kuriwaki (2025+). ```{r} b_cov = ei_local_cov(m, spec) round(sqrt(diag(b_cov)), 3) round(b_cov[1:3, 1:3], 3) ``` The rows and columns are ordered by predictor within outcome, i.e. (Y1|X1, Y1|X2, ..., Y2|X1, Y2|X2, ...). One can visualize the estimated correlation structure with `heatmap()` or the [`corrplot`](https://cran.r-project.org/package=corrplot) package. ```{r fig.height=7.5, fig.alt = "Heatmap of the estimated correlation structure for local estimands"} heatmap( cov2cor(b_cov), Rowv = NA, col = hcl.colors(n = 100, palette = "Spectral"), symm = TRUE, mar = c(12, 12) ) ``` We strongly recommend examining the estimated covariance structure and evaluating if the estimates are plausible. For example, preferences for Black and Other voters are correlated (blue) and somewhat anticorrelated with preferences for White voters (orange/red). Support for Humphrey is generally anticorrelated with support for Nixon and Wallace. These two patterns make sense in context. On the other hand, the estimated standard deviation for Black support for different candidates is quite large, around 0.3, especially compared to low variation in White preference for e.g. Humphrey, which has standard deviation just 0.086. As a reminder, these estimates are of the _residual_ variation, after controlling for covariates. Still, we might be expect residual variation in Black support to be smaller across the board, especially for the segregationist Wallace. # Local point estimates With the regression model and a covariance structure in hand, `ei_est_local()` projects the regression predictions onto the accounting constraint to produce local estimates that exactly satisfy the ecological identity within each county. The function also computes asymptotically valid confidence intervals. Estimates and intervals will be truncated to the possible values obtained from `ei_bounds()`. The `b_cov` argument controls the assumed covariance structure. There are three common choices: - `b_cov = ei_local_cov(m, spec)`: the **data-estimated covariance**, which uses the structure estimated above. *This is the recommended choice*. - `b_cov = 0`: the **orthogonal model**, which assumes no correlation in the local estimands across predictors within a unit. When the estimated `b_cov` is implausible, this can be an acceptable choice. - `b_cov = 0.95` (or another value near 1): the **neighborhood model**, which assumes nearly perfect positive correlation across predictors. This is appropriate when counties are relatively homogeneous and the local estimands are expected to vary together. However, it leads to quite narrow confidence intervals, which may not cover the truth in practice. Here, we fit all three and compare the average width of the resulting confidence intervals. ```{r} e_rcov = ei_est_local(m, spec, b_cov = b_cov, bounds = c(0, 1), sum_one = TRUE) e_orth = ei_est_local(m, spec, b_cov = 0, bounds = c(0, 1), sum_one = TRUE) e_nbhd = ei_est_local(m, spec, b_cov = 0.95, bounds = c(0, 1), sum_one = TRUE) c( estimated = mean(e_rcov$conf.high - e_rcov$conf.low), orthogonal = mean(e_orth$conf.high - e_orth$conf.low), neighborhood = mean(e_nbhd$conf.high - e_nbhd$conf.low) ) ``` Setting `sum_one = TRUE` enforces the constraint that the local vote shares across candidates sum to one within each racial group, which is the correct restriction for these data. The `bounds = c(0, 1)` argument truncates estimates to the unit interval and caps the standard errors implied by the width of the bounds. We can visualize the distribution of local estimates for a particular predictor-outcome combination with a histogram. ```{r fig.alt="Histogram of local estimates for White support for Humphrey"} hist( subset(e_rcov, predictor == "vap_white" & outcome == "pres_dem_hum")$estimate, breaks = 50, main = "White support for Humphrey", xlab = "% Humphrey" ) ``` To examine results for a specific county, we can filter the output. ```{r} subset(e_rcov, .row == 1) ``` The `as.array()` method provides a convenient view of the point estimates as a three-dimensional array. ```{r} head(as.array(e_rcov)[, , "pres_rep_nix"]) ``` # References McCartan, C., & Kuriwaki, S. (2025+). Identification and semiparametric estimation of conditional means from aggregate data. Working paper [arXiv:2509.20194](https://arxiv.org/abs/2509.20194). Chernozhukov, V., Cinelli, C., Newey, W., Sharma, A., & Syrgkanis, V. (2024). Long story short: Omitted variable bias in causal machine learning (No. w30302). *National Bureau of Economic Research.* This vignette was originally produced by a large language model, and then reviewed and edited by the package authors.