Stories from September 21st, 2010

NVidia announces CUDA-x86

Breaking News: On stage at the GTC2010 Keynote, NVidia announced that they are working with PGI, the Portland Group, to develop a new compiler CUDA-x86.  If you don’t have a GPU cluster handy, you can run your CUDA code directly on the CPU for testing and debugging, and even deploy it to users who don’t have suitable GPU’s.

Of course, you’ll need the GPU for optimum performance, but it makes CUDA not just an NVidia technology, but a technology that runs on any computer.

Update 12:10pm it was confirmed later that this will not be a free offering, but rather a commercial product marketed and sold by PGI.

Update 9/23 1pm: More information here.

Science ,

Industrial Light and Magic & NVIDIA Quadro

Richard Kerris, CTO, ILM, discusses how NVIDIA Quadro GPUs enabled the creation of breakthrough visual effects. Hear how the ILM creative team was able to create life-like simulations of fire for blockbuster movies, including Harry Potter and the Last Air Bender.

Hardware , ,

 
Stories from September 20th, 2010

UK Provider Peer 1 offers Free GPU Cloud Access

If you’ve wanted to try some heavy-ended GPU compute, but haven’t had the time or money to build your own superpowered cluster, PEER 1 is here with an interesting offering.  They’re expanding their traditional datacenter and virtualized hosting servies to include GPU cloud support, and for a limited time they are offering free access for folks to try it out.

PEER 1 Hosting in the UK has launched a supercomputing cloud service based on the Nvidia Tesla S1070 and M2050 GPU computing systems. We’re talking serious computing power here; the S1070 is a 1U rack mount that contains 960 processor cores and four teraflops of computing power.

via Need some supercomputer power for your datacenter? Check the cloud. | ZDNet. via InsideHPC

Hardware, Science , ,

 
Stories from September 15th, 2010

Tech-X Corporation Releases GPULib v1.4

Tech-X has just rev’ed their GPULib product to version 1.4 and added CUDA 3.1 support, as well as new IDLv8 and MATLAB features.

GPULib v1.4 supports CUDA streams, enabling concurrent execution of multiple kernels. The product also supports asynchronous data transfer. GPULib now leverages new features of IDL v8.0 enabling more seamless integration between the two products. The product also has an updated MATLAB interface. This release of GPULib also includes a variety of new algorithms, such as functions for sorting and large histogramming. GPULib v1.4 is compatible with CUDA Toolkit 3.1.

Also, they’ll be at GTC next week demonstrating GPULib in medical image processing and computational electromagnetic simulations.

Full press release after the break.

Read more…

Science , , ,

 
Stories from September 14th, 2010

NVidia Releases CUDA3.2, NSight 1.5

NVidia has today released the newest version of their popular CUDA Toolkit, version 3.2, that boasts all around performance improvements and several new features.   The new version includes a new Sparse Matrix library ‘CUSPARSE’ to offset the command CUBLAS and CULAPACK libraries that excel at dense matrices.  Also, they have a new GPU-accelerated random-number library ‘CURAND’.  GPU accelerated random numbers may seem a bit pointless at first glance, but random number entropy is a big deal in large-scale crypto, so I’m sure certain government labs will love that feature.  But even that’s not all, as they’ve added some nice cluster management features (to allow admins to lock processes to certain GPU’s, a necessary feature in queue-driven clusters) as well as support for 64-bit memory addressing which opens up the 6GB memory available on the Quadro 6000.

In addition, they’ve just announced the new version of Parallel NSight, v1.5, that includes compatibility with Microsoft Visual Studio 2010.  The new version offers a new “Dual GPU” mode that enables the Compute Debugger on a system with 2 suitable GPU’s, previously a feature reserved only for network debugging or the Multi-OS SLI systems.  It adds support for the new Fermi Hardware (GTS460 and such), and all of the features of CUDA3.2.

For those of you in the GPU compute space, however, the big news may be the new ‘TCC’ Driver.  For a while now, Nvidia has offered a special ‘Tesla Compute Cluster’ driver that enables CUDA and GPU support without dragging in the Windows Display Subsystems.  While initially intended to overcome some problems with Window’s strange requirements for hardware access when using Remote Desktop and in cluster systems like HPCServer, the driver loads the Tesla card (or Quadro card, if you really want to) not as a display device, but as an additional compute card installed in the system.  While not intended, Nvidia found some interesting side-effects in how Windows deals with it.  When working with the  Windows Display systems and the WDDM (Windows Display Driver Model), you are required to bundle all of your kernels together before you load them to the card, each kernel taking approximately 30 microseconds.   If you, instead, go through the Windows Driver Model (WDM) then you can load kernels when convenient, and it only takes approximately 2.5 microseconds.  That means a complex situation requiring 10 compute kernels:

  • WDDM: 30 microseconds * 10 kernels = 300 microseconds
  • WDM: 2.5 microseconds * 10 kernels = 25 microseconds.

For people doing very heavy GPU computation, this adds up fast.  However, users found themselves having to make a choice:  Load up the TCC driver and lose all display support, or load up the display driver and deal with the slightly degraded performance.

No more, as the new driver enables a run-time switch that can toggle between Display mode and TCC mode.  Now you can take your dual Quadro system and run in graphics SLI mode for superior performance, then switch one of your Quadros to TCC mode and run your compute codes faster.  Granted, it’s not a situation many people find themselves in but for the few that do: It’s a welcome change.

Parallel NSight will be available next week (at GTC conveniently) on September 22nd.

Full release after the break.

Read more…

Science , ,

 
Stories from September 1st, 2010

Resource Of The Week 9/1/10: CUDA By Example

This week’s recommended resource is for anyone gearing up for NVidia’s GPU Technology Conference at the end of this month, and comes straight from two senior developers in the CUDA software platform team, the recently released CUDA By Example.

CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. You’ll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance.

Major topics covered include

  • Parallel programming
  • Thread cooperation
  • Constant memory and events
  • Texture memory
  • Graphics interoperability
  • Atomics
  • Streams
  • CUDA C on multiple GPUs
  • Advanced atomics
  • Additional CUDA resources

All the CUDA software tools you’ll need are freely available for download from NVIDIA.
http://developer.nvidia.com/object/cuda-by-example.html

This book is actually the foundation (I’m told) of a recent Webinar series on CUDA, sponsored by Nvidia.  This book, and many others, is available in the VizWorld store.

Science , , ,

 
Stories from August 30th, 2010

High performance GPU radix sorting in CUDA

Google Code is hosting a project from the University of Virginia that claims to be the fastest ever sorting algorithm, taking advantage of GPU’s.

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).

In addition, one of our design goals for this project is flexibility. We’ve designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs, and for a wide variety of input types.

They have some great detail on the website, but it looks like their algorithms requires that the entire sorting deck fit into GPU memory as they specify a maximum input deck size of 272M, which at 4-bytes per integer that puts it right over 1G..

via RadixSorting – back40computing – High performance GPU radix sorting in CUDA – Project Hosting on Google Code.

Science , , ,

 
Stories from August 23rd, 2010

NVIDIA Names Georgia Institute of Technology a CUDA Center of Excellence

NVidia has just announced that the Georgia Institute of Technology, not to be confused with that other ‘git’ thing, has just joined the exclusive ranks of the “CUDA Center of Excellence” universities.

“Georgia Tech has a long history of education and research that depends heavily on the parallel processing capabilities that NVIDIA has introduced with its CUDA architecture,” Vetter said. “This award allows us to focus, what is now a large amount of activity across 25 different research groups, under a single center, which will significantly amplify our research capabilities.”

In particular, GIT is responsible for the Ocelot project we brought you back in December.   Does this mean a future version of CUDA will run on CPU’s as well at GPU’s?  Only time will tell.

via NVIDIA Names Georgia Institute of Technology a CUDA Center of Excellence – MarketWatch.

Science , ,

 
Stories from July 26th, 2010

High Performance and Scalable GPU Radix Sorting

Over at the NVidia forums, one enterprising researcher has optimized the Radix Sort algorithm to run on the NVidia GPU’s (GTX480 with CUDA) and put it up against previously optimized & published results from NVidia.  The results are staggering.

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our GPU sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the Giga-keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementations, and we operate on keys of any C/C++ numeric type. Satellite values are optional, and can be any arbitrary payload structure (within reason).

On a quad core i7 from Intel: 240M 32-bit Keys per second.

On a 32-core Knights Ferry MIC (the successor to Larrabee): 560 32-bit Keys per second.

On the GTX480: 1,005M 32-bit Keys per second.

What makes this particularly impressive is that one of Intel’s arguments has always been that GPU algorithm performance is achievable via CPU optimization if care is taken.  They were proud of those optimized results on the Intel hardware, and the NVidia hardware easily doubled the throughput.

via SRTS Radix Sort: High Performance and Scalable GPU Radix Sorting – NVIDIA Forums.

Science , , ,

 
Stories from July 21st, 2010

NVidia Releases Parallel Nsight To The Masses

NVidia’s Nsight GPU Debugging tool has been in beta for a few months, but no longer as NVidia today has announced it entering the mainstream as a production tool for CUDA debugging.  Right now you can head on over to the NSight page and get NSight 1.0 for free for all your CUDA, C/C++, and .NET programming needs within Visual Studio.

Visual Studio developers can now use Parallel Nsight to debug CUDA C/C++, or DirectCompute applications on the GPU using the same familiar tools and techniques as on the CPU. Parallel Nsight also provides the analysis tools that give developers the information required to achieve the highest levels of GPGPU application performance.

In addition, they’ve taken the surprising step of splitting the tool into a Standard (free) and Professional (paid) version, with some extra new features.  If you fork out for the Pro version, you get:

  • The System Analyzer, a tool for timeline inspection and performance analysis
  • The Ability to set Data Breakpoints, in addition to Code Breakpoints
  • OpenCL Support
  • Premium Support

The Professional version isn’t quite ready for release yet, so you can go over right now and get a time-limited trial (30 days) of the Release Candidate to try it out.

Both versions come with tools for graphics debugging and HLSL debugging, enabling such impressive features as pausing your graphic and selecting an individual pixel and digging through all of the various shaders that affected the final result.  Perhaps the most powerful, and my favorite, feature is the flexibility in the debugging hardware environment.  Of course, you can run everything on one machine if you like, but it can go far beyond that.

With a bit of configuration work, you can build a workstation with 2 Quadro cards, and exploit the SLI MultiOS support to run your debugger in the host OS, and monitor an application running on the second GPU in the virtualized environment.  This will be great for those tricky ‘system locker’ problems.

But if you really want to go ‘extreme’, say you have an entire visualization cluster you need to debug or large arrays of QuadroPlex systems, you can use the new Networking feature.  Connect over an ordinary TCP/IP link to the client machine and debug, analyze, and inspect your code from the comfort of your own desk.

All together NVidia has drawn a new line in the sand, defining the new standard for GPU/GPGPU debugging technology, and made NSight a must-have product for Visual Studio developers.  The Analyzer tool will be great for those people trying to eek out the last few cycles of performance in their code, and the remote debugging features will be a welcome addition for anyone trying to debug on large-scale GPU clusters or graphics cards arrays (NextIO, QuadroPlex, Tesla, etc).

Read the full announcement after the break, and go check NVidia’s Site for the downloads!

Read more…

Hardware, Science , , ,

VizWorld.com is a production of VizWorld, LLC © 2009