--- title: "Get started with rsgl" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Get started with rsgl} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This tutorial introduces the SGL language as well as usage of the rsgl package. ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Setup For use with the examples in this tutorial, we will create an in-memory [DuckDB](https://duckdb.org) database and load it with two tables, `cars` and `trees`. ```{r setup, warning = FALSE} library(rsgl) library(duckdb) con <- dbConnect(duckdb()) dbWriteTable(con, "cars", cars) dbWriteTable(con, "trees", trees) ``` Let's query each to view a sample of data: ```{r cars_sample_query} dbGetQuery(con, " select * from cars limit 5 ") ``` ```{r trees_sample_query} dbGetQuery(con, " select * from trees limit 5 ") ``` ## dbGetPlot The primary interface to rsgl is the `dbGetPlot` function, which takes a [DBI](https://dbi.r-dbi.org) database connection and a SGL statement and returns the corresponding plot. Although the examples in this tutorial use a connection to a DuckDB database, `dbGetPlot` will accept any DBI connection. ## The SGL Language ### The From Clause The `from` keyword precedes a data source specification, which is often the name of a table in the database. Here, we specify the `cars` table as the data source. ```{r from_cars} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using points ") ``` This is similar in usage to the `from` keyword in SQL, except that only a single data source is allowed (i.e., a comma-separated list of table names is not valid). If data from multiple sources or pre-processing of data is necessary, then a SQL subquery can be provided: ```{r, from_subquery} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from ( select * from cars where origin = 'Japan' ) using points ") ``` ### The Using Clause The `using` keyword precedes the name of the geometric object(s) that will represent the data. Following ggplot2 terminology, these geometric objects are referred to as geoms. Our previous examples demonstrated representing data with point geoms. ### The Visualize Clause The `visualize` keyword precedes the aesthetic-to-column mapping, which maps perceivable traits of the geoms to data source columns. Our prior examples mapped the `x` and `y` positions of the point geoms to data source columns. However, aesthetics may be non-positional, as shown below. The `visualize` keyword most closely resembles the `select` keyword within SQL. ```{r, visualize} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y, origin as color from cars using points ") ``` In addition to mapping aesthetics to columns, aesthetics can be mapped to expressions that include transformations and aggregations. ### Column-Level Transformations and Aggregations SGL supports column-level transformations and aggregations, as shown below where a binning transformation is combined with a count aggregation to produce a histogram on `miles_per_gallon`. ```{r, histogram} dbGetPlot(con, " visualize bin(miles_per_gallon) as x, count(*) as y from cars group by bin(miles_per_gallon) using bars ") ``` Here we see that, similarly to SQL, SGL has a `group by` clause where aggregation groupings are specified. Although SQL itself supports column-level transformation, grouping, and aggregation, it is desirable for SGL to provide additional support for these operations. Statistical graphics often require operations such as binning that are not natively supported by SQL. Additionally, SGL's column-level transformations and aggregations are performed after scaling, which cannot easily be replicated using SQL. Below is an example of this feature, where binning and counting are applied after log scaling, resulting in a log-scaled histogram. ```{r} dbGetPlot(con, " visualize bin(miles_per_gallon) as x, count(*) as y from cars group by bin(miles_per_gallon) using bars scale by log(x) ") ``` ### The Collect By Clause In SGL, a geom is classified as individual if it represents each record (after transformation and aggregation) by a distinct geometric object. Alternatively, a geom is classified as collective if it represents multiple records by one geometric object. For example, points and lines are individual and collective geoms, respectively, as shown below where the same data is represented using each. ```{r} dbGetPlot(con, " visualize year as x, avg(miles_per_gallon) as y from cars group by year using points ") ``` ```{r} dbGetPlot(con, " visualize year as x, avg(miles_per_gallon) as y from cars group by year using line ") ``` For collective geoms, the collection of records to represent by each object is determined automatically using reasonable defaults. This behavior can be overridden by providing explicit collections in the `collect by` clause. The `collect by` clause is similar to the `group by` clause, except that rather than defining groups to aggregate by, the `collect by` clause defines collections of records to be represented by one object. Below we see an example where the default collection is not ideal, followed by a preferrable explicit collection. ```{r} dbGetPlot(con, " visualize age as x, circumference as y from trees using line ") ``` ```{r} dbGetPlot(con, " visualize age as x, circumference as y from trees collect by tree_id using lines ") ``` ### Geom Qualifiers Geom qualifiers modify how geoms positionally represent data, and are specified as keywords that precede geom names within the `using` clause. Geom qualifiers can largely be classified into two groups, statistical qualifiers and collision qualifiers. Statistical qualifiers modify the positional representation via statistical transformation, such as linear regression: ```{r} dbGetPlot(con, " visualize age as x, circumference as y from trees using regression line ") ``` Collision qualifiers specify positional adjustments that are relevant to overlapping objects. Below we see an example of using the `jittered` qualifier to add a small amount of random variation so that overlapping points are discernible. Without the `jittered` qualifier: ```{r} dbGetPlot(con, " visualize origin as x, miles_per_gallon as y from cars using points ") ``` With the `jittered` qualifier: ```{r} dbGetPlot(con, " visualize origin as x, miles_per_gallon as y from cars using jittered points ") ``` ### The Layer Operator The graphics in previous sections contain a single layer of geometric objects. However, it is common to use multiple layers of objects to represent data in a single graphic. In SGL, multiple layers can be combined using the `layer` operator. For example, we can layer a regression line on a scatterplot: ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using points layer visualize horsepower as x, miles_per_gallon as y from cars using regression line ") ``` Layers often share a data source and aesthetic mapping. To reduce verbosity in these cases, the `layer` operator can be applied directly to geom expressions: ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using ( points layer regression line ) ") ``` Layers may have different data sources and aesthetic mappings. However, a graphic has a single scale for each aesthetic, regardless of the number of layers. As a result, a given aesthetic must be mapped to consistent type across all layers where it is present, e.g. an aesthetic cannot be mapped to a numerical type in one layer and a categorical type in another. ### The Scale By Clause Each mapped aesthetic has a scale that determines how data values are mapped to the corresponding visual property. Scales are determined implicitly by default, but can be explicitly specified within the `scale by` clause. Below we specify a `log` scale for the `x` and `y` aesthetics, overriding the default linear scaling for numerical mappings: ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using ( points layer regression line ) scale by log(x), log(y) ") ``` Scaling is performed prior to column-level transformations and aggregations. Additionally, scaling is performed prior to any positional modifications specified by geom qualifiers, e.g., the regression calculation above is performed after log-scaling the x and y aesthetics. Scaling functions such as `log` are applied to aesthetic names rather than column names as these are modifications to aesthetic scales rather than actual data values. This distinction is demonstrated below, where a `log` function is instead applied to actual data values in a SQL subquery: ```{r} dbGetPlot(con, " visualize log_hp as x, log_mpg as y from ( select log(horsepower) as log_hp, log(miles_per_gallon) as log_mpg from cars ) using ( points layer regression line ) ") ``` ### Coordinate Systems The graphics in prior examples use Cartesian coordinates, but alternative coordinate systems are valid. In SGL, the coordinate system is inferred from the positional aesthetics in the `visualize` clause. For example, `x` and `y` aesthetics imply Cartesian coordinates, whereas `theta` and `r` imply polar coordinates. The examples below display the same information in Cartesian and polar coordinates. In the grammar of graphics (which SGL is based on), pie charts are stacked bar charts in a polar coordinate system. In SGL, the bar geom is stacked by default. ```{r} dbGetPlot(con, " visualize count(*) as y, origin as color from cars group by origin using bars ") ``` ```{r} dbGetPlot(con, " visualize count(*) as theta, origin as color from cars group by origin using bars ") ``` Since a SGL statement may have multiple layers, it may also have multiple `visualize` clauses, each with positional aesthetic mappings. Since a graphic has a single coordinate system, the positional aesthetics referenced across layers must be consistent, e.g. one layer cannot reference `x` and `y` aesthetics while another references `theta` and `r`. ### The Facet By Clause Faceting generates small multiples where each panel represents a different partition of the source data. Partitioning is determined by the unique values for expressions specified in the `facet by` clause. Faceting by a single expression generates horizontal panels by default, but this can be modified with an orientation keyword: ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using points facet by origin ") ``` ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using points facet by origin vertically ") ``` Two facet expressions may be specified, in which case one expression is represented horizontally while the other is represented vertically: ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from ( select *, case when year < 1977 then '< 1977' else '>= 1977' end as 'era' from cars ) using points facet by era, origin ") ``` In SGL, facets are a graphic-level property, meaning that each SGL statement has at most one `facet by` clause, and that each layer is faceted accordingly. ### The Title Clause Titles for aesthetic scales are automatically determined from aesthetic mappings. However, this can be overridden by providing explicit titles in the `title` clause: ```{r} dbGetPlot(con, " visualize horsepower as x, miles_per_gallon as y from cars using points title x as 'Horsepower', y as 'Miles Per Gallon' ") ``` ## Next steps - [Reference](https://sgl-projects.github.io/rsgl/reference/index.html) — details on the specific geoms, aesthetics, qualifiers, transformations, aggregations, and scales available in rsgl. - [Example gallery](https://sgl-projects.github.io/rsgl/articles/example-gallery.html) — a collection of plots generated with rsgl. - [SGL Paper](https://arxiv.org/pdf/2505.14690) — covers the language in greater depth, including SGL's underlying grammar of graphics.