## 16.1 Graphical queries

This section focuses on a particular approach to linking views known as graphical (database) queries using the R package plotly. With plotly, one can write R code to pose graphical queries that operate entirely client-side in a web browser (i.e., no special web server or callback to R is required). In addition to teaching you how to pose queries with the highlight_key() function, this section shows you how to control how queries are triggered and visually rendered via the highlight() function.

Figure 16.1 shows a scatterplot of the relationship between weight and miles per gallon of 32 cars. It also uses highlight_key() to assign the number of cylinders to each point so that when a particular point is ‘queried’ all points with the same number of cylinders are highlighted (the number of cylinders is displayed with text just for demonstration purposes). By default, a mouse click triggers a query, and a double-click clears the query, but both of these events can be customized through the highlight() function. By typing help(highlight) in your R console, you can learn more about what events are supported for turning graphical queries on and off.

library(plotly)
mtcars %>%
highlight_key(~cyl) %>%
plot_ly(
x = ~wt, y = ~mpg, text = ~cyl, mode = "markers+text",
textposition = "top", hoverinfo = "x+y"
) %>%
highlight(on = "plotly_hover", off = "plotly_doubleclick")

FIGURE 16.1: A visual depiction of how highlight_key() attaches metadata to graphical elements to enable graphical database queries. Each point represents a different car and the number of cylinders (cyl) is assigned as metadata so that when a particular point is queried all points with the same number of cylinders are highlighted. For the interactive, see https://plotly-r.com/interactives/link-intro.html

Generally speaking, highlight_key() assigns data values to graphical marks so that when graphical mark(s) are directly manipulated through the on event, it uses the corresponding data values (call it $SELECTION_VALUE) to perform an SQL query of the following form. SELECT * FROM mtcars WHERE cyl IN$SELECTION_VALUE

For a more useful example, lets use graphical querying to pose interactive queries of the txhousing dataset. This data contains monthly housing sales in Texan cities acquired from the TAMU real estate center and made available via the ggplot2 package. Figure 16.2 shows the median house price in each city over time which produces a rather busy (spaghetti) plot. To help combat the overplotting, we could add the ability to click a particular a given point on a line to highlight that particular city. This interactive ability is enabled by simply using highlight_key() to declare that the city variable be used as the querying criteria within the graphical querying framework.

One subtlety to be aware of in terms of what makes Figure 16.2 possible is that every point along a line may have a different data value assigned to it. In this case, since the city column is used as both the visual grouping and querying variable, we effectively get the ability to highlight a group by clicking on any point along that line. Section 16.4.1 has examples of using different grouping and querying variables to query multiple related groups of visual geometries at once, which can be a powerful technique.23

# load the txhousing dataset
data(txhousing, package = "ggplot2")

# declare city as the SQL 'query by' column
tx <- highlight_key(txhousing, ~city)

# initiate a plotly object
base <- plot_ly(tx, color = I("black")) %>%
group_by(city)

# create a time series of median house price
base %>%
group_by(city) %>%
add_lines(x = ~date, y = ~median)

FIGURE 16.2: Graphical query of housing prices in various Texas cities. The query in this particular example must be triggered through clicking directly on a time series. For the interactive, see https://plotly-r.com/interactives/txmissing.html

Querying a city via direct manipulation is somewhat helpful for focusing on a particular time series, but it’s not so helpful for querying a city by name and/or comparing multiple cities at once. As it turns out, plotly makes it easy to add a selectize.js powered dropdown widget for querying by name (aka indirect manipulation) by setting selectize = TRUE.24 When it comes to comparing multiple cities, we want to be able to both retain previous selections (persistent = TRUE) as well as control the highlighting color (dynamic = TRUE). This videos explains how to use these features in Figure 16.3 to compare pricing across different cities.

highlight(time_series, on = "plotly_click", selectize = TRUE, dynamic = TRUE, persistent = TRUE)

FIGURE 16.3: Using a selectize dropdown widget to search for cities by name and comparing multiple cities through persistent selection with a dynamic highlighting color. For a visual and audio explanation, see https://bit.ly/txmissing-modes.

By querying a few different cities in Figure 16.3, one obvious thing we can learn is that not every city has complete pricing information (e.g., South Padre Island, San Marcos, etc). To learn more about what cities are missing information as well as how that missingness is structured, Figure 16.4 links a view of the raw time series to a dot-plot of the corresponding number of missing values per city. In addition to making it easy to see how cities rank in terms of missing house prices, it also provides a way to query the corresponding time series (i.e., reveal the structure of those missing values) by brushing cities in the dot-plot. This general pattern of linking aggregated views of the data to more detailed views fits the famous and practical information visualization advice from Shneiderman (1996): “Overview first, zoom and filter, then details on demand”.

# remember, base is a plotly object, but we can use dplyr verbs to
# manipulate the input data
# (txhousing with city as a grouping and querying variable)
dot_plot <- base %>%
summarise(miss = sum(is.na(median))) %>%
filter(miss > 0) %>%
add_markers(x = ~miss, y = ~forcats::fct_reorder(city, miss), hoverinfo = "x+y") %>%
layout(
xaxis = list(title = "Number of months missing"),
yaxis = list(title = "")
)

subplot(dot_plot, time_series, widths = c(0.2, 0.8), titleX = TRUE) %>%
layout(showlegend = FALSE) %>%
highlight(on = "plotly_selected", dynamic = TRUE, selectize = TRUE)

FIGURE 16.4: Linking a dot-plot of the number of missing housing prices with the raw time series. By brushing markers on the dot-plot, their raw time series is highlighted on the right hand side.

How does plotly know to highlight the time series when markers in the dot-plot are selected? The answer lies in what data values are embedded in the graphical markers via highlight_key(). When ‘South Padre Island’ is selected, like in Figure 16.5, it seems as though the logic says to simply change the color of any graphical elements that match that value, but the logic behind plotly’s graphical queries is a bit more subtle and powerful. Another, more accurate, framing of the logic is to first imagine a linked database query being performed behind the scenes (as in Figure 16.5). When ‘South Padre Island’ is selected, it first filters the aggregated dot-plot data down to just that one row, then it filters down the raw time-series data down to every row with ‘South Padre Island’ as a city. The drawing logic will then call Plotly.addTrace() with the newly filtered data which adds a new graphical layer representing the selection, allowing us to have finely-tuned control over the visual encoding of the data query.

The biggest advantage of drawing an entirely new graphical layer with the filtered data is that it becomes easy to leverage statistical trace types for producing summaries that are conditional on the query. Figure 16.6 leverages this functionality to dynamically produce probability densities of house price in response to a query events. Section 16.4.2 has more examples of leveraging statistical trace types with graphical queries.

hist <- base %>% add_histogram(x = ~median, histnorm = "probability density")
subplot(time_series, hist, nrows = 2) %>%
layout(barmode = "overlay", showlegend = FALSE) %>%
highlight(dynamic = TRUE, selectize = TRUE, selected = attrs_selected(opacity = 0.3))

FIGURE 16.6: Linking house prices as a function of time with their probability density estimates.

Another neat consequence of drawing a completely new layer is that we can control the plotly.js attributes in that layer through the selected argument of the highlight() function. In Figure 16.6 we use it to ensure the new highlighting layer has some transparency to more easily compare the city specific distribution to the overall distribution.

This section is designed to help give you a foundation for leveraging graphical queries in your own work. Hopefully by now you have a rough idea what graphical queries are, how they can be useful, and how to create them with highlight_key() and highlight(). Understanding the basic idea is one thing, but applying it effectively to new problems is another thing entirely. To help spark your imagination and demonstrate what’s possible, Section 16.4 has numerous subsections each with numerous examples of graphical queries in action.

## 16.2 Highlight versus filter events

Section 16.1 provides an overview of plotly’s framework for highlight events, but it also supports filter events. These events trigger slightly different logic:

• A highlight event dims the opacity of existing marks, then adds an additional graphical layer representing the selection.
• A filter event completely remove existing marks and rescales axes to the remaining data.25

Figure 16.7 provides a quick visual depiction in the difference between filter and highlight events. At least currently, filter events must be fired from filter widgets from the crosstalk package, and these widgets expect an object of class SharedData as input. As it turns out, the highlight_key() function, introduced in Section 16.1, creates a SharedData instance and is essentially a wrapper for crosstalk::SharedData$new(). class(highlight_key(mtcars)) #> [1] "SharedData" "R6" Figure 16.7 demonstrates the main difference in logic between filter and highlight events. Notice how, in the code implementation, the ‘querying variable’ definition for filter events is part of the filter widget. That is, city is defined as the variable of interest in filter_select(), not in the creation of tx. That is (intentionally) different from the approach for highlight events, where the ‘querying variable’ is a property of the dataset behind the graphical elements. library(crosstalk) # generally speaking, use a "unique" key for filter, # especially when you have multiple filters! tx <- highlight_key(txhousing) gg <- ggplot(tx) + geom_line(aes(date, median, group = city)) filter <- bscols( filter_select("id", "Select a city", tx, ~city), ggplotly(gg, dynamicTicks = TRUE), widths = c(12, 12) ) tx2 <- highlight_key(txhousing, ~city, "Select a city") gg <- ggplot(tx2) + geom_line(aes(date, median, group = city)) select <- highlight( ggplotly(gg, tooltip = "city"), selectize = TRUE, persistent = TRUE ) bscols(filter, select) FIGURE 16.7: Comparing filter to highlight events. Filter events completely remove existing marks and rescales axes to the remaining data. For the interactive, see https://plotly-r.com/interactives/filter-highlight.html When using multiple filter widgets to filter the same dataset, as done in Figure 16.8, you should avoid referencing a non-unique querying variable (i.e., key-column) in the SharedData object used to populate the filter widgets. Remember that the default behavior of highlight_key() and SharedData$new() is to use the row-index (which is unique). This ensures the intersection of multiple filtering widgets queries the correct subset of data.

library(crosstalk)
tx <- highlight_key(txhousing)
widgets <- bscols(
widths = c(12, 12, 12),
filter_select("city", "Cities", tx, ~city),
filter_slider("sales", "Sales", tx, ~sales),
filter_checkbox("year", "Years", tx, ~year, inline = TRUE)
)
bscols(
widths = c(4, 8), widgets,
plot_ly(tx, x = ~date, y = ~median, showlegend = FALSE) %>%
add_lines(color = ~city, colors = "black")
)

FIGURE 16.8: Filtering on multiple variables. For the interactive, see https://plotly-r.com/interactives/multiple-filter-widgets.html

As Figure 16.9 demonstrates, filter and highlight events can work in conjunction with various htmlwidgets. In fact, since the semantics of filter are more well-defined than highlight, linking filter events across htmlwidgets via crosstalk should generally be more well-supported.26

library(leaflet)

eqs <- highlight_key(quakes)
stations <- filter_slider("station", "Number of Stations", eqs, ~stations)

p <- plot_ly(eqs, x = ~depth, y = ~mag) %>%
highlight("plotly_selected")

map <- leaflet(eqs) %>%

bscols(
widths = c(6, 6, 3),
p, map, stations
)

FIGURE 16.9: Linking plotly and leaflet through both filter and highlight events. For the interactive, see https://plotly-r.com/interactives/plotly-leaflet-filter.html

When combining filter and highlight events, one (current) limitation to be aware of is that the highlighting variable has to be nested inside filter variable(s). For example, in Figure 16.10, we can filter by continent and highlight by country, but there is currently no way to highlight by contintent and filter by country.

library(gapminder)
g <- highlight_key(gapminder, ~country)
continent_filter <- filter_select("filter", "Select a country", g, ~continent)

p <- plot_ly(g) %>%
group_by(country) %>%
add_lines(x = ~year, y = ~lifeExp, color = ~continent) %>%
layout(xaxis = list(title = "")) %>%
highlight(selected = attrs_selected(showlegend = FALSE))

bscols(continent_filter, p, widths = 12)

FIGURE 16.10: Combining filtering and highlighting with non-unique querying variables. For the interactive, see https://plotly-r.com/interactives/gapminder-filter-highlight.html

The graphical querying framework (Section 16.1) works in tandem with key-frame animations Section (14). Figure 16.11 extends Figure 14.1 by layering on linear models specific to each frame and specifying continent as a key variable. As a result, one may interactively highlight any continent they wish, and track the relationship through the animation. In the animated version of Figure 14.1, the user highlights the Americas, which makes it much easier to see that the relationship between GDP per capita and life expectancy was very strong starting in the 1950s, but progressively weakened throughout the years.

g <- highlight_key(gapminder, ~continent)
gg <- ggplot(g, aes(gdpPercap, lifeExp, color = continent, frame = year)) +
geom_point(aes(size = pop, ids = country)) +
geom_smooth(se = FALSE, method = "lm") +
scale_x_log10()
highlight(ggplotly(gg), "plotly_hover")

FIGURE 16.11: Highlighting the relationship between GDP per capita and life expectancy in the Americas and tracking that relationship through several decades. For the interactive, see https://plotly-r.com/interactives/gapminder-highlight-animation.html

In addition to highlighting objects within an animation, objects may also be linked between animations. Figure 16.12 links two animated views: on the left-hand side is population density by country and on the right-hand side is GDP per capita versus life expectancy. By default, all of the years are shown in black and the current year is shown in red. By pressing play to animate through the years, we can see that all three of these variables have increased (on average) fairly consistently over time. By linking the animated layers, we may condition on an interesting region of this data space to make comparisons in the overall relationship over time.

For example, in Figure 16.12, countries below the 50th percentile in terms of population density are highlighted in blue, then the animation is played again to reveal a fairly interesting difference in these groups. From 1952 to 1977, countries with a low population density seem to enjoy large increases in GDP per capita and moderate increases in life expectancy, then in the early 80s, their GPD seems to decrease while the life expectancy greatly increases. In comparison, the high density countries seems to enjoy a more consistent and steady increase in both GDP and life expectancy. Of course, there are a handful of exceptions to the overall trend, such as the noticeable drop in life expectancy for a handful of countries during the nineties, which are mostly African countries feeling the affects of war.

The gapminder data does not include a measure of population density, but the gap dataset (included with the plotlyBook R package) adds a column containing the population per square kilometer (popDen), which helps implement Figure 16.12. In order to link the animated layers (i.e., red points), we need another version of gap that marks the country variable as the link between the plots (gapKey).

data(gap, package = "plotlyBook")

gapKey <- highlight_key(gap, ~country)

p1 <- plot_ly(gap, y = ~country, x = ~popDen, hoverinfo = "x") %>%
add_markers(alpha = 0.1, color = I("black")) %>%
add_markers(data = gapKey, frame = ~year, ids = ~country, color = I("red")) %>%
layout(xaxis = list(type = "log"))

p2 <- plot_ly(gap, x = ~gdpPercap, y = ~lifeExp, size = ~popDen,
text = ~country, hoverinfo = "text") %>%
add_markers(color = I("black"), alpha = 0.1) %>%
add_markers(data = gapKey, frame = ~year, ids = ~country, color = I("red")) %>%
layout(xaxis = list(type = "log"))

subplot(p1, p2, nrows = 1, widths = c(0.3, 0.7), titleX = TRUE) %>%
hide_legend() %>%
animation_opts(1000, redraw = FALSE) %>%
layout(hovermode = "y", margin = list(l = 100)) %>%
highlight("plotly_selected", color = "blue"