---
title: "Creating data packages"
author: "Thierry Onkelinx"
output:
  rmarkdown::html_vignette:
        fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Creating data packages}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

The `data_package()` function creates a `datapackage.json` file for a directory containing CSV files that were created by `git2rdata`.
This makes your data compatible with the [Frictionless Data](https://frictionlessdata.io/) specification, allowing other tools and platforms to discover and use your data.

A data package is a simple container format for describing a collection of data files.
The `datapackage.json` file provides metadata about the package and its resources (data files).

## Basic usage

```{r setup}
library(git2rdata)
root <- tempfile("git2rdata-package")
dir.create(root)
```

First, create some data files in non-optimized format (CSV):

```{r create-data}
# Write several datasets in non-optimized (CSV) format
write_vc(
  iris,
  file = "iris",
  root = root,
  sorting = c("Species", "Sepal.Length"),
  optimize = FALSE # Use CSV format instead of optimized TSV
)

write_vc(
  mtcars,
  file = "mtcars",
  root = root,
  sorting = "mpg",
  optimize = FALSE
)

# Check what files were created
list.files(root, recursive = TRUE)
```

Now create the data package:

```{r create-package}
# Create the datapackage.json file
package_file <- data_package(root)
cat("Created:", package_file, "\n")
```

## Package contents

The `datapackage.json` file contains metadata for each CSV file:

```{r show-package}
# Read and display the package file
package_data <- jsonlite::read_json(package_file)

# Show the structure
str(package_data, max.level = 2)
```

Each resource in the package includes:

- **name**: The name of the dataset
- **path**: The relative path to the CSV file
- **profile**: The profile type (tabular-data-resource)
- **schema**: The schema describing the data structure

## Schema information

The schema for each resource describes the fields (columns) in the data:

```{r show-schema}
# Show the schema for the iris dataset
iris_resource <- package_data$resources[[1]]
cat("Resource name:", iris_resource$name, "\n")
cat("Number of fields:", length(iris_resource$schema$fields), "\n\n")

# Show first few fields
for (i in seq_len(min(3, length(iris_resource$schema$fields)))) {
  field <- iris_resource$schema$fields[[i]]
  cat(sprintf(
    "Field %d: %s (type: %s)\n",
    i,
    field$name,
    field$type
  ))
}
```

## Important notes

### CSV format required

`data_package()` only works with non-optimized git2rdata objects (CSV files).
This is because the Frictionless Data specification expects CSV format.

```{r csv-required, error=TRUE}
# This will fail because optimized files use TSV format
optimized_root <- tempfile("git2rdata-optimized")
dir.create(optimized_root)

write_vc(
  iris,
  file = "iris",
  root = optimized_root,
  sorting = "Species",
  optimize = TRUE # This creates TSV files
)

# This will fail with an error
try(data_package(optimized_root))

unlink(optimized_root, recursive = TRUE)
```

### Metadata integration

The function reads the git2rdata metadata (`.yml` files) to extract field information, including:

- Field names
- Field types (mapped to Frictionless Data types)
- Factor levels (for categorical data)
- Description (if available through `update_metadata()`)

### Recursive search

The function searches recursively in the specified directory, so you can organize your data files in subdirectories:

```{r subdirectories}
# Create a subdirectory
subdir <- file.path(root, "subset")
dir.create(subdir)

# Write data in subdirectory
write_vc(
  head(iris, 50),
  file = file.path("subset", "iris_subset"),
  root = root,
  sorting = "Species",
  optimize = FALSE
)

# Recreate the package - it will include the subdirectory file
data_package(root)

# Check the package contents
package_data <- jsonlite::read_json(package_file)
cat("Number of resources:", length(package_data$resources), "\n")
```

## Use cases

### Data sharing

Create a data package to share your datasets with others:

```{r sharing, eval=FALSE}
# After creating your data files
write_vc(my_data, "my_data", root = "data", optimize = FALSE)

# Create the package
data_package("data")

# Share the entire 'data' directory
# Others can now use Frictionless Data tools to read your data
```

### Data validation

The Frictionless Data ecosystem provides tools to validate data packages:

```{r validation, eval=FALSE}
# After creating the package, use frictionless-py or other tools
# to validate your data package
system("frictionless validate datapackage.json")
```

### Data catalogs

Data packages can be published to data catalogs and portals that support the Frictionless Data specification, making your data discoverable.

## See also

- [Frictionless Data documentation](https://frictionlessdata.io/)
- [Data Package specification](https://specs.frictionlessdata.io/data-package/)
- The `metadata` vignette for adding descriptions to your data

```{r cleanup, include=FALSE}
unlink(root, recursive = TRUE)
```