What’s the Big Deal with CUDA and GPGPU anyway?

CUDA, GPGPU, Tesla, OpenCL.. Every day that goes by has another vendor or software company touting the benefits of GPU-accelerated Computing. But why? Video cards have been around for years, so what’s happened in the last few years to make it so lucrative and desirable? Not just for consumers, but for big players in High Performance Computing?

Come on inside for the VizWorld’s feature article on Why everyone is Buzzing about GPGPU.

Why GPGPU Is Winning

The most powerful way to see the impact of GPGPU (General-Purpose Graphics Processing Unit) solutions is to start big. We’ll start at the top of the “Computer Food Chain”, the Supercomputer guys. Large research institutions (Colleges, Companies, Governments) need large-scale supercomputing power to solve today’s toughest problems. The bigger the problem, the bigger the machine. These machines work on problems like Biomolecular Protein Folding for medical companies, they simulate explosions and nuclear armament for the US Military, and they drive the financial transactions that keep money flowing around the world.

And these machines are big, both physically and computationally. Take, for example, the TACC Ranger:

Number of Nodes: 3,936
Number of Cores: 62,976
Peak Performance: 579.4 TFlops
Number of Racks: 82

So in 82 racks packed full of equipment, they get 580 TeraFlops of computational horsepower, based on AMD Opteron Chipsets. Not too shabby. How many of NVidia’s new Tesla units would it take to reach 580 Teraflops?

1 Tesla S1070 = 4 TeraFlops
145 Tesla’s = 580 TeraFlops
42 Tesla S1070’s in 1 Rack
Number of Racks: 3.5

So we can “compress” the TACC system down to 4 cabinets, with space left over, right? Well, not really. These 4 cabinets have immense computing power, sure, but they still need a real computer to talk to. So each of these Tesla systems will need a computer, albeit not a terribly powerful one. Go with a “worst case” of an extra 1U computer for each 1U Tesla, and we’re looking at 7 racks, plus a bit more for Networking and Storage. With the extra computers added to the Tesla’s, we’ve actually got MORE TeraFlops than Ranger in one-tenth the space.

Flip the coin to the other side: Personal Computers. Gamers, housewives, PowerPoint jockeys, they all use computers every day, and every one of them has a video card. They all have a video card anyway to drive their monitors, so what if their existing video card could accelerate regular operations when it wasn’t rendering the latest Movie, Game, or 3D Transition? Most computers these days still run on ancient Intel integrated chips, but a new breed of chips is coming out that could change this. Intel’s Larrabee, and NVidia & AMD’s competing products, have the potential to put a GPGPU-capable chipset in every computer in the world.

So GPGPU can refactor an 82-cabinet supercomputer to 8 or 9 cabinets, and we’ve shrunk it down to 10% of the current floor space requirements, and something similar in power and cooling requirements. GPGPU could re-use existing resources within your computer boost performance without taxing your system. Who wouldn’t want this? Less hardware to maintain, fewer points of failure, less power and cooling to pay for, it’s a win-win all around, right? Not exactly.

The Problems with GPGPU

As much as computer scientists have tried to make it so, GPGPU is not a magic fairy dust that can make anything and everything faster. GPU’s are designed for pushing pixels, not crunching numbers. The fact that they can crunch numbers is almost an accidental side-product of their main purpose, discovered several years ago and only recently becoming something worth investigating.

Consider the following: In an OpenGL scene, you give 3-dimensional triangle coordinates (the 3 points of the triangle) to the GPU. The GPU then translates these 3D coordinates into 2D screen coordinates, transforming for the viewing angle, and then determines the interior pixels of the triangle to activate. Add in extra math to handle lighting & textures, and you can see that alot of somewhat trivial math is happening: Matrix Transforms for the projections, interpolations, basic lighting calculations, etc. If you could massage your desired math into looking somewhat like this math, you could repurpose it. That’s exactly what the first GPGPU programs did.

Anyone looking at GPGPU code 5 years ago would have thought it was plain graphics code. Code would load bizarre scrambled textures into the GPU, draw full-screen triangles and move them around, read the framebuffer back into main memory, and repeat. It didn’t make much sense to watch, but to people familiar with the internals it was a wonder of modern computing. You could massage your input data into 2D textures and render it on the screen. Render other textures on-top of it, and play with the GPU’s texture-blending and lighting operations, and then read the results back. Tada, you’ve just run programs on the GPU. There are however, problems:

The slowest part of the system is loading your data to the GPU, and reading it back from the GPU.
No support for Branching Operations or Conditionals, or very limited at the best.
The language itself is incomprehensible.

The third point was the main roadblock to most programmers, and the people who needed the benefits simply weren’t used to thinking in terms of textures and full-screen triangles. NVidia saw this problem, and has been working for the last several years on alternatives like Cg, CUDA, and most recently OpenCL. These libraries, combined with changes to NVidia’s hardware, make writing GPU programs not much different from writing traditional programs.

The first two points are still a problem, tho. Transferring data from the GPU back to Main Memory is the biggest hindrance to GPU programs. This means that your algorithms need to be structured in such a way that all of your data can be thrown into the GPU, let it swirl around as long as possible, then read back an answer. Algorithms that compute a number and then use that in another computation, in a Recursive fashion, simply don’t work well in a GPU. This hindrance has led to a huge refactoring of classic algorithms into new parallel-friendly means. One fun example of how many people have been working on this is a quick search for “parallel FFT” (Fast Fourier Transform, the foundation of many audio and image processing routines) that turns up 140,000 papers in Google Scholars.

Such a problem means that mainstream GPU-acceleration will be slow-coming. While it can be seen at the consumer-level already, it’s use is limited to only a few areas where the research has shown it’s use, primarily in Video Encoding/Decoding. Adobe Premiere, Nero Move-it, and others are using CUDA and other GPGPU technologies to vastly improve video encoding and decoding times. Also some image editing tools like PhotoShop are using it for filters and such, which are simply large-scale matrix multiplies, perfectly suited for GPU programming. Other applications, like Spell Checking or BitTorrent, have been slower to find ways to improve performance. Computers are still throttled by IO limitations when writing to the Hard-Drive, and that’s increasingly becoming the main bottleneck for computational performance as eventually your results need to be saved to disk.

Where do we go from here?

This is the million dollar question. In fact, Intel has made it a $12 million question with their new Visual Computing Institute in Germany that opened last week. Focusing on hardware and software algorithms to make GPU-accelerating computing more widespread, Intel has put a huge gamble that their new Larrabee chip can give them a door into the upcoming GPU-accelerated computing revolution. NVidia and ATI (now AMD) have had a long-fought duel over who will win, and only now are we beginning to see a serious third player in Intel arrive.

OpenCL is also emerging as a standard library to provide cross-platform support for GPGPU algorithms. Currently most solutions are restricted to only a single GPU-vendor, but OpenCL promises to do for GPU-computing what OpenGL did for GPU-rendering. By providing a standard API that can be implemented and supported by multiple vendors, costs should fall for the consumer and opportunities rise for developers. Also, each vendor will undoubtedly implement custom proprietary extensions to the spec (just as they did with OpenGL) allowing innovations to direct and steer the API where the users and developers want it.

With a cross-vendor API and improved algorithms, the last hurdle is simply widespread adoption. Unknown to many, Intel currently holds the title as the #1 graphics chip vendor in the world. How, you ask? The Intel Integrated graphics chip is built into so many low-end PC motherboards that it has essentially become “free” to manufacturers, and millions of them have been deployed worldwide. If all you plan to do is surf the net or use Microsoft Word, a graphics card is not a priority item. If the “big 3” graphics vendors (AMD, NVidia, & Intel) release a comparable chip that has OpenCL compatibility and that can unseat this ancient chipset, then we could see OpenCL-compatible systems become the norm rather than the object of a few specialized users.