1406 views 2 comments

NVidia Announces CUDA 4.0

by on February 28, 2011

Big news from Nvidia today as they announce the latest version of their GPGPU toolkit, CUDA 4.0.  This new version has all the usual performance enhancements and bugfixes, but also comes with 3 new features that I can guarantee all you CUDA developers are going to love.

  • GPUDirect2

    GPUDirect2.0 – The previous version of GPUDirect worked with clusters based on Mellanox Infiniband backbones, but this new version works with multiple GPU cards in a single machine.  Where previously the CPU was involved in memory transfers between cards, now you can DMA transfer directly between cards using MPI-style Send & Receive commands.

  • Unified Virtual Addressing – Now, CPUs and GPUs all show up in a single uniform address space.  This makes moving memory between them much easier.
  • Integrated Thrust Support – A great C++ library similar to LABLAS and CULAPACK, it adds in standard template constructs for all the popular data types and algorithms.  Thrust has a great following and active community, and boasts run-time selection of CPU vs GPU code, making the resulting code a bit more portable than previous CUDA.

All of this goes to further reinforce NVidia’s commitment to not just building nice graphics cards, but to continue to build and support a developer community around the computational capabilities of their hardware.  With renewed support in their Tegra line and the new ARM cores on the horizon, NVidia knows that having a wide community of developers ready to go on the new hardware is critical to mainstream market success.  Microsoft is already pumping up support for a future ARM-based Windows, and ARM already has wide support in many embedded applications like settop boxes, smartphones, and tablets.  Tools like Unified Virtual Addressing and GPUDirect2 further Nvidia’s attempts to tear down the barriers between the CPU and GPU, making future porting to ARM systems simpler.

Get the full details of the release in the Press Release after the break.

New CUDA 4.0 Release Makes Parallel Programming Easier

Unified Virtual Addressing, GPU-to-GPU Communication and Enhanced C++ Template Libraries Enable More Developers to Take Advantage of GPU Computing

SANTA CLARA, Calif.—Feb. 28, 2011— NVIDIA today announced the latest version of the NVIDIA® CUDA® Toolkit for developing parallel applications using NVIDIA GPUs.

The NVIDIA CUDA 4.0 Toolkit was designed to make parallel programming easier, and enable more developers to port their applications to GPUs. This has resulted in three main features:

NVIDIA GPUDirectTM 2.0 Technology – Offers support for peer-to-peer communication among GPUs within a single server or workstation. This enables easier and faster multi- GPU programming and application performance.

Unified Virtual Addressing (UVA) – Provides a single merged-memory address space for the main system memory and the GPU memories, enabling quicker and easier parallel programming.

Thrust C++ Template Performance Primitives Libraries – Provides a collection of powerful open source C++ parallel algorithms and data structures that ease programming for C++ developers. With Thrust, routines such as parallel sorting are 5X to 100X faster than with Standard Template Library (STL) and Threading Building Blocks (TBB).

“Unified virtual addressing and faster GPU-to-GPU communication makes it easier for developers to take advantage of the parallel computing capability of GPUs,” said John Stone, senior research programmer, University of Illinois, Urbana-Champaign.

“Having access to GPU computing through the standard template interface greatly increases productivity for a wide range of tasks, from simple cashflow generation to complex computations with Libor market models, variable annuities or CVA adjustments,” said Peter Decrem, director of Rates Products at Quantifi. “The Thrust C++ library has lowered the barrier of entry significantly by taking care of low-level functionality like memory access and allocation, allowing the financial engineer to focus on algorithm development in a GPU-enhanced environment.”

The CUDA 4.0 architecture release includes a number of other key features and capabilities, including:

  • MPI Integration with CUDA Applications – Modified MPI implementations like OpenMPI automatically move data from and to the GPU memory over Infiniband when an application does an MPI send or receive call.
  • Multi-thread Sharing of GPUs – Multiple CPU host threads can share contexts on a single GPU, making it easier to share a single GPU by multi-threaded applications.
  • Multi-GPU Sharing by Single CPU Thread – A single CPU host thread can access all GPUs in a system. Developers can easily coordinate work across multiple GPUs for tasks such as “halo” exchange in applications.
  • New NPP Image and Computer Vision Library – A rich set of image transformation operations that enable rapid development of imaging and computer vision applications.
  • New and Improved Capabilities



    • Auto performance analysis in the Visual Profiler
    • New features in cuda-gdb and added support for MacOS
    • Added support for C++ features like new/delete and virtual functions
    • New GPU binary disassembler

A release candidate of CUDA Toolkit 4.0 will be available free of charge beginning March 4, 2011, by enrolling in the CUDA Registered Developer Program at: www.nvidia.com/paralleldeveloper. The CUDA Registered Developer Program provides a wealth of tools, resources, and information for parallel application developers to maximize the potential of CUDA.

For more information on the features and capabilities of the CUDA Toolkit and on GPGPU applications, please visit: www.nvidia.com/cuda.


NVIDIA (NASDAQ:NVDA) awakened the world to the power of computer graphics when it invented the GPU in 1999. Since then, it has consistently set new standards in visual computing with breathtaking, interactive graphics available on devices ranging from tablets and portable media players to notebooks and workstations. NVIDIA’s expertise in programmable GPUs has led to breakthroughs in parallel processing which make supercomputing inexpensive and widely accessible. The Company holds more than 1,700 patents worldwide, including ones covering designs and insights that are essential to modern computing. For more information, see www.nvidia.com.


  • Peter Schmidt

    Well NVIDIA needed to something to cover the achilles heel of its archiceture since most real computation problems in the world exceed the memory of graphik cards. In fact the longer I think about this one can compare this gpu ecosystem with the infrastructure of the PS3 which has some small fast memory and some more and a lot slower memory, also programming CUDA and afterwards tuning your algorithms to suit a particular CPU is as much a PITA like programming for the cell SPEs.

  • Unified Virtual Addressing is a good step in the right direction. NVIDIA has now taken each individual memory space and made it unified. The way that I look at is that it is NUMA (Non-Uniform Memory Access) memory. You will need to pay some attention to the fact that local memory has a bandwidth of 148 GB/sec while accessing the memory on another graphics card will be limited by the PCIe bus at 8 GB/sec.