--- title: "Data Codebook" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Codebook} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(assemblykor) # Helper: summarise a single variable var_summary <- function(x, varname) { n_total <- length(x) n_miss <- sum(is.na(x)) pct_miss <- sprintf("%.1f%%", 100 * n_miss / n_total) if (is.logical(x)) { type_str <- "logical" vals <- paste0("TRUE: ", sum(x, na.rm = TRUE), ", FALSE: ", sum(!x, na.rm = TRUE)) } else if (is.numeric(x)) { type_str <- "numeric" q <- quantile(x, c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE) vals <- sprintf("min=%.0f, Q1=%.0f, median=%.0f, Q3=%.0f, max=%.0f", q[1], q[2], q[3], q[4], q[5]) } else if (inherits(x, "Date")) { type_str <- "Date" vals <- paste(range(x, na.rm = TRUE), collapse = " to ") } else { type_str <- "character" u <- length(unique(x[!is.na(x)])) top <- names(sort(table(x), decreasing = TRUE))[1:min(3, u)] vals <- paste0(u, " unique; top: ", paste(top, collapse = ", ")) } data.frame( Variable = varname, Type = type_str, Missing = pct_miss, Distribution = vals, stringsAsFactors = FALSE ) } # Helper: build a summary table for a dataset codebook_table <- function(df) { rows <- lapply(names(df), function(v) var_summary(df[[v]], v)) do.call(rbind, rows) } ``` This codebook documents all seven built-in datasets in `assemblykor`. For each dataset, we list every variable with its type, missing rate, and value distribution. All datasets can be joined via `member_id` and/or `assembly`. --- ## legislators **947 rows, 15 variables.** MP metadata for the 20th-22nd Korean National Assembly. - **Unit of observation**: legislator-assembly - **Key**: `member_id` + `assembly` (unique) - **Source**: Open National Assembly API ```{r legislators-codebook, echo = FALSE} data(legislators) knitr::kable(codebook_table(legislators), row.names = FALSE) ``` --- ## bills **60,925 rows, 9 variables.** Legislative bill metadata (20th-22nd assembly). - **Unit of observation**: bill - **Key**: `bill_id` (unique) - **Join**: `proposer_id` links to `legislators$member_id` - **Source**: Open National Assembly API ```{r bills-codebook, echo = FALSE} data(bills) knitr::kable(codebook_table(bills), row.names = FALSE) ``` --- ## wealth **2,928 rows, 14 variables.** Legislator asset declaration panel (2015-2025, 13 disclosure periods). - **Unit of observation**: legislator-year - **Key**: `member_id` + `year` (unique) - **Units**: all monetary values in thousands of KRW (1 unit = 1,000 won) - **Source**: OpenWatch (CC BY-SA 4.0) ```{r wealth-codebook, echo = FALSE} data(wealth) knitr::kable(codebook_table(wealth), row.names = FALSE) ``` --- ## seminars **5,962 rows, 18 variables.** Legislator-year policy seminar activity (17th-22nd assembly, 2000-2025). - **Unit of observation**: legislator-year - **Key**: `member_id` + `year` (note: ~5% of `member_id` are `NA`) - **Source**: National Assembly Seminar Database ```{r seminars-codebook, echo = FALSE} data(seminars) knitr::kable(codebook_table(seminars), row.names = FALSE) ``` --- ## speeches **15,843 rows, 9 variables.** Committee speech records from the Science and ICT Committee (22nd assembly, 2024). - **Unit of observation**: speech turn - **Key**: `date` + `speech_order` (unique within a meeting) - **Source**: National Assembly committee minutes ```{r speeches-codebook, echo = FALSE} data(speeches) knitr::kable(codebook_table(speeches), row.names = FALSE) ``` --- ## votes **7,997 rows, 13 variables.** Plenary vote tallies (20th-22nd assembly). - **Unit of observation**: bill vote - **Key**: `bill_id` (unique) - **Join**: `bill_id` links to `bills$bill_id` (~40% match rate; `votes` includes committee alternatives and budget bills not in `bills`) - **Source**: Open National Assembly API ```{r votes-codebook, echo = FALSE} data(votes) knitr::kable(codebook_table(votes), row.names = FALSE) ``` --- ## roll_calls **368,210 rows, 8 variables.** Member-level roll call votes (22nd assembly, 1,233 bills). - **Unit of observation**: legislator-bill vote - **Key**: `member_id` + `bill_id` (unique) - **Join**: `member_id` links to `legislators$member_id`; `bill_id` links to `votes$bill_id` - **Source**: Open National Assembly API ```{r roll-calls-codebook, echo = FALSE} data(roll_calls) knitr::kable(codebook_table(roll_calls), row.names = FALSE) ``` --- ## Dataset relationship diagram ``` legislators (member_id + assembly) / | \ / | \ wealth seminars bills (member_id) (member_id) (proposer_id) | votes (bill_id) | roll_calls (bill_id + member_id) | legislators (member_id) speeches --- legislators (member_id, 22nd assembly only) ``` All datasets share `member_id` as the primary join key. Use `assembly` as a secondary key when joining datasets that span multiple assembly terms.