---
title: "Getting Started with immunogenetr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with immunogenetr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

immunogenetr is a comprehensive toolkit for clinical HLA informatics, built on
tidyverse principles. It uses the genotype list string
(GL string, <https://glstring.org/>) as its core data structure for storing and
computing HLA genotype data.

This vignette walks through the main workflows:

1. Converting tabular HLA data to GL strings
2. Splitting GL strings back into individual loci
3. Calculating mismatches between recipient and donor
4. Summarizing HLA matching for transplantation
5. Working with HLA allele names (truncation, prefixes, regex)
6. Reading HML files

## Setup

```{r message=FALSE}
library(immunogenetr)
library(dplyr)
```

## Converting tabular HLA data to GL strings

Clinical HLA data is typically stored in a tabular format, with each allele in
its own column. immunogenetr includes the `HLA_typing_1` dataset as an example:

```{r}
# HLA_typing_1 contains typing for 10 individuals across all classical HLA loci.
head(HLA_typing_1, 3)
```

The `HLA_columns_to_GLstring()` function converts these columns into a single
GL string per individual. When used inside `mutate()`, pass `.` as the first
argument to reference the working data frame:
```{r}
HLA_typing_GL <- HLA_typing_1 %>%
  # Convert all typing columns (A1 through DPB1_2) into a GL string.
  mutate(
    GL_string = HLA_columns_to_GLstring(., HLA_typing_columns = A1:DPB1_2),
    .after = patient
  ) %>%
  # Keep only patient ID and the new GL string column.
  select(patient, GL_string)

# View the GL strings.
(HLA_typing_GL)
```

Each GL string encodes the full genotype: alleles within a gene copy are
separated by `/` (ambiguity), gene copies by `+`, and loci by `^`.

## Splitting GL strings into loci

To go the other direction, `GLstring_genes()` splits a GL string back into
separate columns by locus:

```{r}
# Take the first patient's GL string and split it into locus columns.
# Note: GLstring_genes and GLstring_genes_expanded use pivot_longer on all
# columns, so only pass the GL string column (no other data types).
single_patient <- HLA_typing_GL[1, "GL_string", drop = FALSE]
GLstring_genes(single_patient, "GL_string")
```

For a fully expanded view with one allele per row, use
`GLstring_genes_expanded()`:

```{r}
GLstring_genes_expanded(single_patient, "GL_string")
```

## Calculating HLA mismatches

The mismatch functions are the core of immunogenetr. They all take a recipient
GL string, a donor GL string, one or more loci, and a direction.

Let's set up a recipient/donor pair:

```{r}
# Patient 7 is the recipient, patient 9 is the donor.
recip_gl <- HLA_typing_GL %>% filter(patient == 7) %>% pull(GL_string)
donor_gl <- HLA_typing_GL %>% filter(patient == 9) %>% pull(GL_string)
```

### Is there a mismatch? (`HLA_mismatch_logical`)

```{r}
# Check if there is an HLA-A mismatch in the graft-vs-host direction.
HLA_mismatch_logical(recip_gl, donor_gl, "HLA-A", direction = "GvH")

# Check host-vs-graft direction.
HLA_mismatch_logical(recip_gl, donor_gl, "HLA-A", direction = "HvG")
```

### How many mismatches? (`HLA_mismatch_number`)

```{r}
# Count bidirectional mismatches across several loci at once.
HLA_mismatch_number(
  recip_gl, donor_gl,
  c("HLA-A", "HLA-B", "HLA-C", "HLA-DRB1"),
  direction = "bidirectional"
)
```

### Which alleles are mismatched? (`HLA_mismatched_alleles`)

```{r}
# Identify the specific mismatched alleles in the HvG direction.
HLA_mismatched_alleles(recip_gl, donor_gl, "HLA-A", direction = "HvG")
```

### Match count (`HLA_match_number`)

```{r}
# Count the number of matches (complement of mismatches).
HLA_match_number(
  recip_gl, donor_gl,
  c("HLA-A", "HLA-B", "HLA-C", "HLA-DRB1"),
  direction = "bidirectional"
)
```

## HLA match summaries for transplantation

The `HLA_match_summary_HCT()` function provides standard match grades used in
hematopoietic cell transplantation:

```{r}
# X-of-8 matching (A, B, C, DRB1 bidirectional).
HLA_match_summary_HCT(recip_gl, donor_gl,
  direction = "bidirectional",
  match_grade = "Xof8"
)

# X-of-10 matching (adds DQB1).
HLA_match_summary_HCT(recip_gl, donor_gl,
  direction = "bidirectional",
  match_grade = "Xof10"
)
```

### Finding the best donor

A common workflow is comparing one recipient against multiple potential donors:

```{r}
# Patient 3 is the recipient; compare against all 10 donors.
recipient <- HLA_typing_GL %>%
  filter(patient == 3) %>%
  select(GL_string) %>%
  rename(GL_string_recip = GL_string)

donors <- HLA_typing_GL %>%
  rename(GL_string_donor = GL_string, donor = patient) %>%
  # Cross-join to pair recipient with each donor.
  cross_join(recipient) %>%
  # Calculate 8/8 match grade for each pair.
  mutate(
    match_8of8 = HLA_match_summary_HCT(
      GL_string_recip, GL_string_donor,
      direction = "bidirectional",
      match_grade = "Xof8"
    ),
    .after = donor
  ) %>%
  # Sort best matches first.
  arrange(desc(match_8of8))

donors %>% select(donor, match_8of8)
```

## Working with HLA allele names

### Truncation

`HLA_truncate()` reduces allele resolution to a specified number of fields:

```{r}
# Truncate a four-field allele to two fields.
HLA_truncate("HLA-A*02:01:01:01", fields = 2)

# Works on full GL strings too.
HLA_truncate("HLA-A*02:01:01:01+HLA-A*03:01:01:02^HLA-B*07:02:01:01+HLA-B*44:02:01:01",
  fields = 2
)
```

### Prefix management

`HLA_prefix_remove()` and `HLA_prefix_add()` manage the `HLA-` and locus
prefixes:

```{r}
# Remove all prefixes to get just the allele fields.
HLA_prefix_remove("HLA-A*02:01")

# Keep the locus designation but remove "HLA-".
HLA_prefix_remove("HLA-A*02:01", keep_locus = TRUE)

# Add the full prefix back.
HLA_prefix_add("02:01", "HLA-A*")

# "HLA-" is added by default.
HLA_prefix_add("A*02:01")
```

### Regex for GL string searching

`GLstring_regex()` creates regex patterns that accurately search within GL
strings, preventing partial matches across field boundaries:

```{r}
gl <- "HLA-A*02:01:01+HLA-A*68:01^HLA-B*07:01+HLA-B*15:01"

# A two-field search correctly matches the three-field allele.
pattern <- GLstring_regex("HLA-A*02:01")
stringr::str_detect(gl, pattern)

# But won't falsely match a longer allele number.
stringr::str_detect("HLA-A*02:149:01", GLstring_regex("HLA-A*02:14"))
```

## Column name repair

When working in the tidyverse, column names with dashes and asterisks are
inconvenient. `HLA_column_repair()` converts between WHO-standard (`HLA-A*`)
and tidyverse-friendly (`HLA_A`) formats:

```{r}
# GLstring_genes returns tidyverse-friendly names by default.
repaired <- GLstring_genes(single_patient, "GL_string")
names(repaired)

# Convert back to WHO format with asterisks.
who_names <- HLA_column_repair(repaired, format = "WHO", asterisk = TRUE)
names(who_names)
```

## Reading HML files

The `read_HML()` function extracts GL strings from HML (HLA Markup Language)
files, which are a standard format for reporting HLA typing results from
next-generation sequencing:

```{r}
# immunogenetr ships with two example HML files.
hml_path <- system.file("extdata", "HML_1.hml", package = "immunogenetr")
hml_result <- read_HML(hml_path)
hml_result
```

## Disclaimer
This library is intended for research use. Any application making use of this
package in a clinical setting will need to be independently validated according
to local regulations.