7 2D frequencies

7.1 Rectangular binning in plotly.js

The plotly package provides two functions for displaying rectangular bins: add_heatmap() and add_histogram2d(). For numeric data, the add_heatmap() function is a 2D analog of add_bars() (bins must be pre-computed), and the add_histogram2d() function is a 2D analog of add_histogram() (bins can be computed in the browser). Thus, I recommend add_histogram2d() for exploratory purposes, since you don’t have to think about how to perform binning. It also provides a useful zsmooth attribute for effectively increasing the number of bins (currently, “best” performs a bi-linear interpolation, a type of nearest neighbors algorithm), and nbinsx/nbinsy attributes to set the number of bins in the x and/or y directions. Figure 7.1 compares three different uses of add_histogram(): (1) plotly.js’ default binning algorithm, (2) the default plus smoothing, (3) setting the number of bins in the x and y directions. Its also worth noting that filled contours, instead of bins, can be used in any of these cases by using add_histogram2dcontour() instead of add_histogram2d().

p <- plot_ly(diamonds, x = ~log(carat), y = ~log(price))
subplot(
colorbar(title = "default") %>%
layout(xaxis = list(title = "default")),
colorbar(title = "zsmooth") %>%
layout(xaxis = list(title = "zsmooth")),
add_histogram2d(p, nbinsx = 60, nbinsy = 60) %>%
colorbar(title = "nbins") %>%
layout(xaxis = list(title = "nbins")),
shareY = TRUE, titleX = TRUE
)

7.2 Rectangular binning in R

In Bars & histograms, we leveraged a number of algorithms in R for computing the “optimal” number of bins for a histogram, via hist(), and routing those results to add_bars(). There is a surprising lack of research and computational tools for the 2D analog, and among the research that does exist, solutions usually depend on characteristics of the unknown underlying distribution, so the typical approach is to assume a Gaussian form (Scott 1992). Practically speaking, that assumption is not very useful, but 2D kernel density estimation provides a useful alternative that tends to be more robust to changes in distributional form. Although kernel density estimation requires choice of kernel and a bandwidth parameter, the kde2d() function from the MASS package provides a well-supported rule-of-thumb for estimating the bandwidth of a Gaussian kernel density (Venables and Ripley 2002). Figure 7.2 uses kde2d() to estimate a 2D density, scales the relative frequency to an absolute frequency, then uses the add_heatmap() function to display the results as a heatmap.

kde_count <- function(x, y, ...) {
kde <- MASS::kde2d(x, y, ...)
df <- with(kde, setNames(expand.grid(x, y), c("x", "y")))
# The 'z' returned by kde2d() is a proportion,
# but we can scale it to a count
df\$count <- with(kde, c(z) * length(x) * diff(x)[1] * diff(y)[1])
data.frame(df)
}

kd <- with(diamonds, kde_count(log(carat), log(price), n = 30))
plot_ly(kd, x = ~x, y = ~y, z = ~count) %>%
colorbar(title = "Number of diamonds")

7.3 Categorical axes

The functions add_histogram2d(), add_histogram2dcontour(), and add_heatmap() all support categorical axes. Thus, add_histogram2d() can be used to easily display 2-way contingency tables, but since its easier to compare values along a common scale rather than compare colors (Cleveland and McGill 1984), I recommend creating grouped bar charts instead. The add_heatmap() function can still be useful for categorical axes, however, as it allows us to display whatever quantity we want along the z axis (color).

Figure 7.3 uses add_heatmap() to display a correlation matrix. Notice how the limits arguments in the colorbar() function can be used to expand the limits of the color scale to reflect the range of possible correlations (something that is not easily done in plotly.js).

corr <- cor(dplyr::select_if(diamonds, is.numeric))
plot_ly(colors = "RdBu") %>%
add_heatmap(x = rownames(corr), y = colnames(corr), z = corr) %>%
colorbar(limits = c(-1, 1))

References

Cleveland, William S, and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association 79 (September): 531–54.

Scott, David W. 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley & Sons.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.