24 Improving performance
Recall, from Figure 2.5, when you print a plotly object (or really any plot), there are two classes of performance to be aware of: print-time (i.e. build) and run-time (i.e. render). That is, build time can be classified as the time it takes for the object to be serialized as JSON/HTML, whereas run time is the time it takes for the browser to render the HTML into a webpage. In the case of plotly, there are two quick and easy things you can do to improve run time performance in any context:
toWebGL(): This function attempts to render the chart using WebGL (i.e., Canvas) instead of Scalable Vector Graphics (SVG). The difference between these contexts is somewhat analogous to the difference between saving a static chart to png/jpeg (pixel based) versus pdf (vector based). Vector based graphics have the desirable property of producing sharp visuals that scale well to any size, but they don’t scale well in the number of vectors (e.g., points, lines, polygons, etc) that they need to render.
partial_bundle(): This function attempts to reduce the size of the plotly.js bundle used to render the plotly graphs. The size of the default (i.e., main) plotly.js bundle is about 3MB, which can take a considerable amount of time to download with a slow internet connection, potentially leading to noticable lag in initial page load for consumers of the graph. As it turns out, the main bundle is not always necessary to render every graphs on a given website, so plotly.js provides partial bundles that can render certain subsets of the graphing library. For instance, if you only need scatter, bar, or pie trace types, you can get away with the basic bundle which is currently under 1MB in size. This function is always safe to use when rendering a single plotly graph in a web page, but when rendering multiple graphs, you should take care not to include mutliple bundles in the same page.
These two options may improve run-time performance without much of any thinking, but sometimes it’s worth being more thoughtful about your visualization strategy by leveraging summaries (e.g., Section 13.3, Figure 17.27, Figure 22.1) as well as being more explicit about how a graph responds to changes in the underlying data (e.g., Section 17.3.1). Mastering these more broad and complex subjects is critical for scaling interactive visualizations to truly large data38, especially in the case of linking multiple views, where computational ‘tricks’ such as pre-aggregating distributive (e.g., min, max, sum, count) and algebraic (e.g., mean, var, etc) statistics intelligently is a trademark of systems that enable real-time graphical queries of massive datasets (Heer 2013); (Lins, Klosowski, and Carlos Scheidegger 2013); (Heer 2019). As Wickham (2013) points out, it’s also important to consider the uncertainty in these computationally efficient statistics, as they aren’t nearly as statistically robust as their holistic conterparts (e.g., mean vs median) that are more computationally intensive.
Since latency in interactive graphics is known to make exploratory data analysis a more challenging task (Heer 2014), systems that optimize run over build performance are typically preferrable. This is especially true for visualizations that others are consuming, but in a typical EDA context, where the person creating the visualization is main consumer, build time performance is also important factor because it also presents a hurdle to the analytical thought process. It’s hard to give general advice on improving build-time performance in general, but a great first step in doing so is to profile the speed of your R code with something like the profvis package. This will at least let you know if the slowness you’re experiencing is due to your own R code.
Heer, Zhicheng Liu AND Biye Jiang AND Jeffrey. 2013. “ImMens: Real-Time Visual Querying of Big Data.” Computer Graphics Forum (Proc. EuroVis) 32 (3). http://idl.cs.washington.edu/papers/immens.
Lins, Lauro, James T. Klosowski, and and Carlos Scheidegger. 2013. “Nanocubes for Real-Time Exploration of Spatiotemporal Datasets.” Visualization and Computer Graphics, IEEE Transactions.
Heer, Dominik Moritz AND Bill Howe AND Jeffrey. 2019. “Falcon: Balancing Interactive Latency and Resolution Sensitivity for Scalable Linked Visualizations.” In ACM Human Factors in Computing Systems (Chi). http://idl.cs.washington.edu/papers/falcon.
Wickham, Hadley. 2013. “Bin-Summarise-Smooth: A Framework for Visualising Large Data.” had.co.nz.
Heer, Zhicheng Liu AND Jeffrey. 2014. “The Effects of Interactive Latency on Exploratory Visual Analysis.” IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). http://idl.cs.washington.edu/papers/latency.
Large data means different things to different people at different time periods. At the time of writing, I’d consider hundreds of millions of observations with at least a handful of variables to be large data.↩