With the new flexibility of OpenGL4.0 and the Fermi architecture, Cyril Crassin decided to revisit some of the older A-Buffer algorithms and see what kind of improvements he could manage.  What is an a-buffer, you ask?

Basically an A-buffer is a simple list of fragments per pixel [Carpenter 1984]. Previous methods to implement it on DX10 generation hardware required multiple passes to capture an interesting number of fragments per pixel. They where essentially based on depth-peeling, with enhancements allowing to capture more than one layer per geometric pass, like the k-buffer, stencil routed k-buffer. Bucket sort depth peeling allows to capture up to 32 fragments per geometry pass but with only 32 bits per fragment (just a depth) and at the cost of potential collisions. All these techniques were complex and especially limited by the maximum of 8 render targets that were writable by the fragment shader.

So, he rewrote the algorithms to exploit the new capabilities.  How did it perform?

It worked pretty well since it provides something like a 1.5x speedup over the fastest previous approach (at least I know about !), with zero artifact and supporting arbitrary number of layers with a single geometry pass.

I would call that a resounding success! Find details and example code at his site.

via Icare3D Blog: Fast and Accurate Single-Pass A-Buffer using OpenGL 4.0+.