My non-computer-geek friends will probably want to skip this...
I've written this primarily as a note to myself to summarize what I've learned recently about GPGPU computing. However, I expect that some of my number crunching friends and colleagues will be interested in the topic, so I hope that this will be interesting to them as well.
Back when I was an undergrad in the 1980's, I learned about the "graphics pipeline" used in computer graphics. This was a collection of algorithmic processing steps, sometimes implemented in software and sometimes in specialized hardware. At the time, these computations were done in fixed point arithmetic, and the topic was very esoteric- few programmers knew or cared much about it. The hardware implementations were very expensive- you only found them in high end graphics workstations.
Fast forward through the 90's, as this technology worked its way down to personal computers and into graphics cards costing less than $100. Over time, these graphics cards became more and more powerful, employing lots of specialized functional units and using floating point arithmetic. The graphics processing unit (GPU) was highly specialized for graphics processing. At a certain point in the mid 2000's, it became simpler to replace the specialized functional units in GPU's with general purpose processing units that could do all of the graphics processing tasks. The General Purpose Graphics Processing Unit (GPGPU) was born.
GPGPU's are different from main stream computer CPU's in ways that should be of great interest to people who do scientific computing.
First, GPGPU's have lots of processors, and particularly lots of floating point arithmetic units. Each floating point unit is no faster than the floating point unit found on a microprocessor CPU core, but the GPGPU might have hundreds of processing units instead of the 2, 4, or 8 found in a multicore CPU. It's not uncommon for a computer to have a CPU that can theoretically do a maximum of 20 billion floating point operations per second (20 GFLOPS), and a GPGPU that can do 500 GFLOPS. Until recently GPGPU's could only do single precision floating point arithmetic, but the manufacturers have recently started adding double precision capability. The GPGPU has its own device memory, made up of the same kind of DRAM chips used in the main system memory. A modern high end card might have as much as 4 gigabytes of DRAM. However, because graphics processing demands very high memory bandwidth, the memory system on a GPGPU typically has a much wider data transfer path. Instead of transferring 8 bytes per memory access, a GPGPU might have a path to memory that is much wider (e.g. 64 bytes wide.) As a result the GPGPU has very high memory bandwidth, although memory latency is similar to the latency of the main system memory.
In response to the problem of memory latency, the processing units in GPGPU's are typically multithreaded. Whenever an execution thread is waiting for data to be loaded from DRAM, another thread can quickly take over the processing unit and do some work while the data is brought in from memory. This strategy has turned out to not be valuable for microprocessors, but it can be very useful in GPGPU's which are often working on many different tasks at the same time and have high bandwidth high latency memories.
This should be of interest to people who do numerical computing for two reasons.
First, there's an opportunity to use these GPGPU's to do your number crunching. Researchers have attracted a lot of attention to GPGPU computing by publishing papers showing speedups of 100 or more by using a GPGPU. Furthermore, the costs of these cards is being driven down by the huge market of video game playing computer geeks- unlike most hardware for high performance computing, there's a very large market for these devices.
A second important point is that 10 years from now, it's likely that mainstream CPU's will start to look a lot like GPGPUs- we're getting more and more cores with each new generation, and memory bandwidth continues to increase while memory latency has not improved much in recent years. This is effectively a preview of future computing technology.
There are two major manufacturers of GPGPU's, Nvidia and AMD/ATI. The microprocessor manufacturer AMD bought up the independent GPU maker ATI in a strategic move into GPU's a few years ago. Intel also makes simple GPU's, but seems to be more interested in developing this kind of technology for general purpose use rather than selling it in the form of graphics cards.
Unfortunately, there are significant differences in the architectures used by Nvidia and AMD. A very generic low level programming standard called OpenCL has been developed that can be used on both companies GPGPU's, but most of the attention has focused on Nvidia's CUDA architecture and their CUDA extensions to the C programming language. Since this stuff depends on specific features of Nvidia's hardware, it simply won't run on AMD/ATI hardware. In the following, I'll focus on CUDA.
The basic idea in CUDA is that a program running on the system's CPU transfers data to the GPGPU and then a specialized program (called a "kernel") runs on the GPGPU to process the data. The results are then transferred back to the main system memory.
The data is split up and processed by many different threads running on the processing units of the GPGPU. Each excution thread runs exactly the same program, and processes only a section of the original data. CUDA is somewhat more powerful than a single instruction mutiple data (SIMD) architecture in that each thread has its own set of registers and can have private memory. In CUDA it is possible for the threads to synchronize and communicate with each other through a small shared memory.
The threads generally run in lock step, except that it is possible to have conditional data dependent instructions. For example, depending on whether x(thread) is positive or negative, some threads might do nothing (because their x values are positive) while other threads do computations that are only appropriate when x(thread) is negative. Nvidia calls this single instruction multiple thread (SIMT). The kernel programs can be written in C with CUDA extensions. However, this is a very tricky process that requires extensive knowledge of the CUDA architecture and tuning for the particular card that you have.
At a somewhat higher level, there is a CUDA version of the BLAS called CUBLAS that implements dense matrix operations such as matrix-matrix multiply. To multiply A times B using CUDA, you'd call CUBLAS routines to copy the matrices A and B from system memory to the GPGPU, call a CUBLAS routine to do the matrix multiplication, and then call another CUBLAS routine to copy the data back to system memory. Note that if your data is too large to fit into the device memory of the GPGPU, you'll have to break it up into chunks before sending it to the GPGPU for processing. Nvidia gives away CUBLAS with the CUDA developer tools.
At an even higher level, there's a CUDA implementation of LAPACK called CULA that is available as a commercial product. A single call to a CULA routine (e.g. SPOTRF to compute the Cholesky factorization of a matrix) can copy the data to the GPGPU, use CUBLAS to do the computations, and then copy the results back to system memory.
There are obvious tradeoffs here- from the programmer's point of view, using CULA is the easiest way to get started. However, this may be less efficient than using CUBLAS, becuase CULA copies stuff back and forth to main system memory for each call. In comparison, with CUBLAS you can move your data to the GPGPU, do many CUBLAS operations, and then copy the final results back into system memory. Writing kernels in C with CUDA extensions is the most flexible approach but requires the greatest understanding of the CUDA architecture.
After learning all of the above, I've decided to forego GPGPU computing for now. I hesitate to get involved because (1) the graphics card on my current system does support CUDA, but only in single precision and it only has 256 megabytes of RAM and and isn't much faster than my CPU (2) the basic CUDA development tools are free (as in beer) from Nvidia but not open source and (3) the higher level CULA stuff isn't even free.
Where will this go in five years? Although Nvidia is the early leader, I'm not convinced that they'll be able to hold onto this lead. A fundamental problem with CUDA is that the GPGPU is not well integrated into the rest of the system- data has to be transferred to and from the GPGPU over a relatively slow PCIE bus. It would be better if the GPGPU and the CPU worked directly with the same memory. Of course, Nvidia is in no position to do this, since they don't make CPU's.
It seems likely that both AMD and Intel will start selling chips that combine conventional CPU cores with GPGPU like cores for parallel processing. AMD has announced it's concept for this, called "Fusion" Intel has been developing a prototype of a new many core architecture called Larrabe that might be turned into either a GPGPU or into a high performance computing product or both.
What's really lacking so far is a good high level language for programming these systems. One interesting research project called "Brook" was developed and promoted for a while by ATI, but it seems to have fallen by the way side. Fortran 95 programs using the built-in matrix operations (and perhaps LAPACK library routines) could potentially be compiled directly into code that would run on a GPGPU architecture. Perhaps some new programming language oriented towards data parallel programming will take hold.