Visualizing System Latency with Heatmaps
An interesting article in the ACM Queue is “Visualizing System Latency” by Brendan Gregg. Analyzing system latency has traditionally been hard to do, but is getting better thanks to better hardware and instrumented software systems. However, dealing with the mountains of data that come from these systems can be a bit overwhelming. He proposes a method of visualizing where the two axis indicate time & latency, with the pixel intensity mapped to IO Counts (for analyzing latency from IO operations). His description of the image above:
In this screenshot, a panel is displayed to the left of the heat map to show average IOPS counts. Above and below the panel the “Range average:” and “8494 ops per second” show the average NFS I/O per second for the visible time range (x-axis). Within the panel are averages for latency ranges, the first showing an average of 2,006 NFS IOPS between 0 and 333 µs. Each of these latency ranges corresponds to a row of pixels on the heat map.
The left half of the image shows IO operations when using NFS with nothing but DRAM and Disk. You see several IO operations at the very bottom (the dark line), which is data stored in memory, and several more starting at about the 2ms mark which is the disk & NFS latency. The right half of the map shows the same test with an extra layer of SSD cache added. The result is obvious: Latency dropped significantly, and it’s easy to see that from the graph.
This heat map shows that a flash-memory-based cache had reduced latency for I/O that would otherwise be served from disk. All three system components were visualized, with their latency ranges and the distribution of latency within that range. It also shows that disk I/O still occurs, although at a reduced rate. This is all useful information provided by the heat-map visualization. Imagine presenting this data as a line graph of average latency instead: the only information visible would be a small reduction in average latency when the cache was enabled (small since the average would be dominated by the high number of DRAM hits).
His paper proposes some alternate tests and visualization algorithms for working with IO Latency visualization, including his “Icy Lake” visualization. Some great work here.