# 5 Bars & histograms

The `add_bars()`

and `add_histogram()`

functions wrap the bar and histogram plotly.js trace types. The main difference between them is that bar traces require bar heights (both `x`

and `y`

), whereas histogram traces require just a single variable, and plotly.js handles binning in the browser.^{15} And perhaps confusingly, both of these functions can be used to visualize the distribution of either a numeric or a discrete variable. So, essentially, the only difference between them is where the binning occurs.

Figure 5.1 compares the default binning algorithm in plotly.js to a few different algorithms available in R via the `hist()`

function. Although plotly.js has the ability to customize histogram bins via `xbins`

/`ybins`

, R has diverse facilities for estimating the optimal number of bins in a histogram that we can easily leverage.^{16} The `hist()`

function alone allows us to reference 3 famous algorithms by name (Sturges 1926; Freedman and Diaconis 1981; Scott 1979), but there are also packages (e.g. the **histogram** package) which extend this interface to incorporate more methodology (Mildenberger, Rozenholc, and Zasada. 2009). The `price_hist()`

function below wraps the `hist()`

function to obtain the binning results, and map those bins to a plotly version of the histogram using `add_bars()`

.

```
p1 <- plot_ly(diamonds, x = ~price) %>%
add_histogram(name = "plotly.js")
price_hist <- function(method = "FD") {
h <- hist(diamonds$price, breaks = method, plot = FALSE)
plot_ly(x = h$mids, y = h$counts) %>% add_bars(name = method)
}
subplot(
p1, price_hist(), price_hist("Sturges"), price_hist("Scott"),
nrows = 4, shareX = TRUE
)
```

Figure 5.2 demonstrates two ways of creating a basic bar chart. Although the visual results are the same, its worth noting the difference in implementation. The `add_histogram()`

function sends all of the observed values to the browser and lets plotly.js perform the binning. It takes more human effort to perform the binning in R, but doing so has the benefit of sending less data, and requiring less computation work of the web browser. In this case, we have only about 50,000 records, so there is not much of a difference in page load times or page size. However, with 1 Million records, page load time more than doubles and page size nearly doubles.^{17}

```
library(dplyr)
p1 <- plot_ly(diamonds, x = ~cut) %>%
add_histogram()
p2 <- diamonds %>%
count(cut) %>%
plot_ly(x = ~cut, y = ~n) %>%
add_bars()
subplot(p1, p2) %>% hide_legend()
```

## 5.1 Multiple numeric distributions

It is often useful to see how the numeric distribution changes with respect to a discrete variable. When using bars to visualize multiple numeric distributions, I recommend plotting each distribution on its own axis using a small multiples display, rather than trying to overlay them on a single axis.^{18}. Chapter 13, and specifically Section 13.1.2.3, discuss small multiples in more detail, but Figure 13.9 demonstrates how it be done with `plot_ly()`

and `subplot()`

. Note how the `one_plot()`

function defines what to display on each panel, then a split-apply-recombine (i.e., `split()`

, `lapply()`

, `subplot()`

) strategy is employed to generate the trellis display.

```
one_plot <- function(d) {
plot_ly(d, x = ~price) %>%
add_annotations(
~unique(clarity), x = 0.5, y = 1,
xref = "paper", yref = "paper", showarrow = FALSE
)
}
diamonds %>%
split(.$clarity) %>%
lapply(one_plot) %>%
subplot(nrows = 2, shareX = TRUE, titleX = FALSE) %>%
hide_legend()
```

## 5.2 Multiple discrete distributions

Visualizing multiple discrete distributions is difficult. The subtle complexity is due to the fact that both counts and proportions are important for understanding multi-variate discrete distributions. Figure 5.4 presents diamond counts, divided by both their cut and clarity, using a grouped bar chart.

Figure 5.4 is useful for comparing the number of diamonds by clarity, given a type of cut. For instance, within “Ideal” diamonds, a cut of “VS1” is most popular, “VS2” is second most popular, and “I1” the least popular. The distribution of clarity within “Ideal” diamonds seems to be fairly similar to other diamonds, but it’s hard to make this comparison using raw counts. Figure 5.5 makes this comparison easier by showing the relative frequency of diamonds by clarity, given a cut.

```
# number of diamonds by cut and clarity (n)
cc <- count(diamonds, cut, clarity)
# number of diamonds by cut (nn)
cc2 <- left_join(cc, count(cc, cut, wt = n, name = 'nn'))
cc2 %>%
mutate(prop = n / nn) %>%
plot_ly(x = ~cut, y = ~prop, color = ~clarity) %>%
add_bars() %>%
layout(barmode = "stack")
```

This type of plot, also known as a spine plot, is a special case of a mosaic plot. In a mosaic plot, you can scale both bar widths and heights according to discrete distributions. For mosaic plots, I recommend using the **ggmosaic** package (Jeppson, Hofmann, and Cook 2016), which implements a custom **ggplot2** geom designed for mosaic plots, which we can convert to plotly via `ggplotly()`

. Figure 5.6 shows a mosaic plot of cut by clarity. Notice how the bar widths are scaled proportional to the cut frequency.

```
library(ggmosaic)
p <- ggplot(data = cc) +
geom_mosaic(aes(weight = n, x = product(cut), fill = clarity))
ggplotly(p)
```

### References

Freedman, D., and P. Diaconis. 1981. “On the Histogram as a Density Estimator: L2 Theory.” *Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete* 57: 453–76.

Jeppson, Haley, Heike Hofmann, and Di Cook. 2016. *Ggmosaic: Mosaic Plots in the ’Ggplot2’ Framework*. http://github.com/haleyjeppson/ggmosaic.

Mildenberger, Thoralf, Yves Rozenholc, and David Zasada. 2009. *Histogram: Construction of Regular and Irregular Histograms with Different Options for Automatic Choice of Bins*. https://CRAN.R-project.org/package=histogram.

Scott, David W. 1979. “On Optimal and Data-Based Histograms.” *Biometrika* 66: 605–10.

Sturges, Herbert A. 1926. “The Choice of a Class Interval.” *Journal of the American Statistical Association* 21 (153): 65–66. https://doi.org/10.1080/01621459.1926.10502161.

As we’ll see in Section 16.1, and specifically Figure 16.6, using ‘statistical’ a trace type like

`add_histogram()`

enables statistical graphical queries.↩︎Optimal in this context is the number of bins which minimizes the distance between the empirical histogram and the underlying density.↩︎

These tests were run on Google Chrome and loaded a page with a single bar chart. See https://www.webpagetest.org/result/160924_DP_JBX for

`add_histogram()`

and https://www.webpagetest.org/result/160924_QG_JA1 for`add_bars()`

.↩︎It’s much easier to visualize multiple numeric distributions on a single axis using lines↩︎