Multicore CPUs can Match GPUs for FLOP-heavy Applications?

A research paper from IBM analyzes a FLOP-Intensive algorithm (dot-products and additions of 2D matrices of single-precision floating point values), and finds that the CPU can actually beat the GPU versions. I’ll let the abstract do the talking (although I’ve reformatted it for readability).

We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization.
The performance of this algorithm on the nVidia GPU suffers from:
a smaller shared memory,
unaligned device memory access patterns,
expensive atomic operations, and
weaker single-thread performance.
On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version.
The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

Not really surprised to see the Power7 perform the best (Given that it’s IBM’s chip and IBM engineers at the helm here). If you read the paper, you’ll see that they use a 500×500 image (4MB in size) with matrices that require 250KB of space (page 6). This won’t fit into the cache of the GTX285 so they spend much time paging data in and out.

I’ld be very curious to see if the new Fermi GTX480 changes this any.

via IBM Research | Technical Paper Search | Believe it or Not! Multicore CPUs can Match GPUs for FLOP-intensive Applications!|(Search Reports).

Tagsalgorithm ibm paper

Randall Hand

Randall Hand is a computer graphics programmer and news junky that's been working in the field for the last 15 years. He's responsible for visualizations generated on some of the most powerful supercomputers in the world, ytnef, mullion support in ParaView, and VizWorld.com.

1 Comment

Stefan says:
July 1, 2010 at 5:05 pm
>Multicore CPUs can Match GPUs for FLOP-heavy Applications?
Well, I never could make CUDA to do complex job faster then I could get from multi-core CPU. Only SIMD friendly algorithms are zippier under GPU but such algorithms are used to be a brute force algorithms with little room to reduce their time complexity (mostly due to some hardware related nuances). Smart algorithms which take into account the local property of data are inherently non-SIMD friendly and can be effectively implemented only for efficient MIMD machine as i7 for example. The multi-core trend for general CPU objectively should squeeze out GPU even from the areas it has been competitive for a while unless GPU becomes a multi-core CPU (once it may run Specint_rate).
For the record: it’s just my opinion and it is not the point of view of majority in computer industry (thanks to computer games training, I guess, plus NVIDIA propaganda).
Stefan

Multicore CPUs can Match GPUs for FLOP-heavy Applications?

Randall Hand

You might also like

Editor Picks

1 Comment