NVIDIA’s CUDA programming language is a wonderful way of exploiting their graphics cards (Graphics Processing Units) for general purpose computing (GPGPU).  Effectively CUDA offers access to the cards via a relatively simple syntax extension to C (one of my favourite languages).

However I’ve found the descriptions of the mixture of grids, blocks, warps and threads that one must understand in order to program in CUDA to not be especially well defined anywhere.  I think some of this problem derives from a muddle between the logical and physical model.  CUDA offers a generic logical model : there is no real enforcement that the hardware will be a GPU – in fact, one can compile for a normal processor.  However there is one key constraint : groups of threads in a block are assumed to be able to communicate quickly internally, but blocks are expected to be independent (while yet processing identical code) and able to operate asynchronously relative to other blocks.

To help myself, I’ve put together the following diagram of what I understand the logical model to be.  I believe it is correct, but please post any corrections you find.

CUDA logical model

It is perhaps a little weak on the different types of memory available, being primarily focussed on thread execution, but it does highlight some points.

To properly program in CUDA you do also need to take account of the physical factors relating to resources attached to Streaming Multiprocessors, especially available shared memory, and the bandwidth-latency product.  However introducing these factors into the logical model tends (I believe) to confuse.   The logical model is therefore a first step.