AMD has published a “whitepaper” (it irks me that they call these Whitepapers when they’re actually powerpoint presentations) discussing optimizations for Image Convolution algorithms on both CPU and GPU.  They start with an algorithm and add some optimizations for the memory overlap, and then naively port it to a Radeon 5870 to run in 1511 ms.  Then, with some careful optimizations, work it down to a mere 182ms!

Download the PDF, or view it Online.