Extreme Scaling of Production Visualization Software on Diverse Architectures
This month’s issue of the IEEE Computer Graphics & Applications journal is dedicated to “Ultrascale Visualization”, all about visualizing massive datasets across some of the largest computers in the world. In particular, this issue contains the article about the massive “Trillion Zone” run of VisIt we discussed a while back.
A series of experiments studied how visualization software scales to massive data sets. Although several paradigms exist for processing large data, the experiments focused on pure parallelism, the dominant approach for production software. The experiments used multiple visualization algorithms and ran on multiple architectures. They focused on massive-scale processing (16,000 or more cores and one trillion or more cells) and weak scaling. These experiments employed the largest data set sizes published to date in the visualization literature. The findings on scaling characteristics and bottlenecks will help researchers understand how pure parallelism performs at high levels of concurrency with very large data sets.
The paper is available from the IEEE Computer Society for $19, but I was lucky enough to get a review copy. Read on to see some details and my thoughts.
My first thought when hearing about this was “Why VisIt?” Being primarily a ParaView user myself, I wondered why it was left out. Something like this seems an ideal case to compare different tools, but that wasn’t the focus of the paper. From one of the opening paragraphs:
We performed our experiments using only a single visualization tool, VisIt, although we don’t believe this limits the impact of our results. We aimed to understand whether pure parallelism will work at extreme scale, not to compare tools. When a program succeeds, it validates the underlying technique. When a program fails, it might indicate a failing in the technique or a poor program implementation. Our principal findings here were that pure parallelism at an extreme scale worked, that algorithms such as contouring and rendering performed well, and that I/O times were very long. Therefore, the only issue requiring further study was I/O performance. We could have addressed this issue by studying other production visualization tools, but they would ultimately employ the same (or similar)
low-level I/O calls, such as fread, that are themselves the key problem. So, rather than varying visualization tools, each of which follows the same I/O pattern, we varied the I/O patterns (that is, we used collective and noncollective I/O) and compared them across architectures and file systems.
Why VisIt specifically was chosen is easy when you look at the Author line, which contains most of the development staff of VisIt.
So, the paper is primarily about the techniques, trials, and successes of “Ultrascale Visualization” across different architectures, and that it does fantastically well. Their test setup is to simply extract and render an isosurface (Marching Cubes) of a very large dataset, which winds up being the Supernova Simulation. They originally had hoped to use Volume Rendering (which is the content of many of the snazzy images shown regarding the work), but the opportunistic design of these computers meant that by the time they had their Volume Rendering algorithm ready to go, their window of opportunity on some of the HPC’s had passed.
On several different machines (listed in this article) they ran their test and broke it down into several instrumented times:
- Total I/O time (to load data)
- Contour Time (to extract/compute the isosurface)
- Rendering Time (to create the final 1024×1024 image)
- Total Pipeline Execution Time All of the Above
When possible, they used multiple dataset sizes (ranging from 0.5 TeraCells to 2 TeraCells), and core-counts (ranging from 8,000 to 32,000). Overall they found several interesting things worthy of note for other application developers working in this space.
First off (and perhaps most important) is the design of the I/O system of the machine. The I/O numbers vary widely across the machines, even within a machine at different filesizes. Most of the systems under consideration use Lustre, but the striping counts varied on each machine. Systems with a small stripe count (2) performed better than ones with a higher count (4) because of VisIt’s design of reading 10 separate files per core. Higher stripe counts increased contention over the I/O subsystems, reducing overall performance.
Second, also worthy of note, was the use of GZIP’ed data files. While the files are smaller on disk, and you can reduce a bit of I/O time by trading it off for CPU time (to decompress the data in memory), this unbalances the load. This makes the file-read time nonlinear as you add processors, which can become an overall detriment at very high numbers.
One other item of interest comes from another filesystem issue encountered on “Dawn”. While the test was run several times on Dawn over the course of several months, a noticable drop in performance occured during the summer. This wound up being due to the back-end filesystem becoming unbalanced, and switching from round-robin controls to statistical Poisson-distribution control, leaving some fileservers responsible for 3x-5x the files of others. While this levels out the imbalance, it impacts end-user access times by overloading some servers.
They also ran tests of various data structures. While the original dataset was fixed, how you convert it from 0.5 Teracells to 2 Teracells can affect the result. They tried it with both simply resampling the dataset and “tiling” the dataset within the space, finding that there was a little difference in performance, but surprisingly little. I/O time and contour time didn’t change much, but the rendering time doubled (Which makes sense, as the number of triangles doubles as well).
Later in the paper they discuss some of the specific VisIt issues they encountered. Perhaps one thing most important to note to future application developers is their finds regarding dynamically loaded libraries on high-concurrency systems. I’ll let their paper do the talking again:
During our first runs on Dawn, using only 4,096 cores, we observed lags in start-up time that worsened as the core count increased. Each core was reading plug-in information from the file system, creating contention for I/O resources. We addressed this problem by modifying VisIt’s plug-in infrastructure so that plug-in information could be loaded on MPI task 0 and broadcast to other cores. This change made plug-in loading nine times faster.
That said, start-up time was still slow, taking as long as five minutes. VisIt uses shared libraries in many instances to let new plug-ins access symbols not used by current VisIt routines; compiling statically would remove these symbols. The likely path forward is to compile static versions of VisIt for the high-concurrency case. This approach will likely be palatable because new plug-ins are frequently developed at lower levels of concurrency.
The push for static libraries has been brewing for a few years now. If you’ve worked in the HPC space recently, you undoubtedly have heard of the “Cray Catamount” debacle: A special mini-OS running on the back-end of the machine that required you to use statically linked applications. It wound up being a headache for users, and eventually forced Cray to switch to the newer CNL (Compute Node Linux) environment. However, if the end result is that you spend 10% of your operational allocation simply trying to load your application in memory, perhaps they were headed the right way in the first place.
The conclusion of the paper does a great job of summing it up. I won’t steal all of their thunder, but this one paragraph is worth sharing:
Our results demonstrate that pure parallelism does scale but is only as good as its supporting I/O infrastructure. We successfully visualized up to four trillion cells on diverse architectures with production visualization software. The supercomputers we used were “underpowered,” in that the current simulation codes on these machines produce meshes far smaller than a trillion cells. They were appropriately sized, however, when considering the rule of thumb that the visualization task should get 10 percent of the simulation task’s resources and assuming our trillion-cell mesh represents the simulation of a hypothetical 160,000-core machine.
Anyone working in visualization knows that I/O is the problem. This paper could become the seminal work on exactly how bad of a problem it is.
Modern supercomputers are designed with I/O to meet researchers needs. In that respect, they are designed for a simulation to run for a very very long time, and occasionally write files to disk. These files are few and far-between, but could be very large.
Visualization requirements are exactly the opposite of that. We need to be able to read very large data from disk, and read it fast. Not only that, but we will be doing it alot, some would say constantly. If you want to, as they do in this paper, extract an isosurface of a teracell dataset then the computation is the trivial part. The slow part is the reading of the data.
I’ld like to thank Hank Childs, David Pugmire, Sean Ahern, and Wes Bethel (In the interest of disclosure, I know all 4 of these great guys) as well as the other researchers involved for the incredible amount of work put into this project. The information they present has been known in the visualization community for several years (We need more I/O bandwidth!) but poorly communicated to people outside that community. This paper is an excellent first step in bringing those issues to the forefront for the next generation of HPC resources.
This paper is great, but only one of several great papers in this issue of the IEEE CG&A. You can access a bit of it for free via their website, but there are some other great papers in there regarding large-scale In-Situ visualization and interactive collaboration. If you work in Scientific Visualization, you’re doing yourself (And your researchers) a disservice by not picking this issue up.