CUFFT 1.1 / 2.0 vs FFTW 3.1.2 (x86_64) vs FFTW 3.2 (Cell) comparison

The following plot is an extension of the below, more complete, investigation. Mainly just to get an idea of how the Cell and newer GPU architecture stacks up. I chose to time the 3D out-of-place transforms, in single precision. I extended the benchmark to include the new Tesla s1070 results (1 GPU - it hits about 27GFlops for 512^3) and the PS3 Cell (first generation, 6 of 8 SPEs enabled).

CUFFT vs FFTW comparison

This benchmark was done in the same fashion as benchFFT, comparing complex-complex single precision FFT speeds between FFTW and the CUDA CUFFT library. Transform sizes were limited by the amount of device memory on the GPU. "mflops" is a derived quantity and is not a measured flop count, see the benchFFT page for details. The reported peak performance by NVIDIA as of Nov. 2007 for a GPU (8800 GTX?) doing FFTs on the device only (no memory transfer from host) is 52GF.

The test machine is a dual-socket dual-core Intel(R) Xeon(R) CPU 5150 @ 2.66GHz, with an Intel 5000X Chipset and an NVIDIA Quadro FX 4600 GPU in a PCI Express 1.0 x16 slot.

Tests were done with fftw-3.1.2 compiled with sse optimizations for both serial and threaded (using 4 threads) transforms, using the FFTW_MEASURE method to generate plans. The CUFFT 1.1 library and CUDA 1.1 toolkit were used on the card. Timings of both the raw device speed as well as including the overhead associated with copying memory to and from the GPU are included.


The code can be found here.

Overview of results

For small ffts (<8192 elements), CUFFT performs much slower than FFTW, even in serial. Likely due to there not being much work, as the transform fits in cpu cache. Batching (multiple transform plan) results in much better speedup on the GPU, I expect this is where the 52GF number comes from (see the forum).

For larger ffts, CUFFT results in up to a 5x speedup over threaded FFTW (up to 10x speedup over serial FFTW)

With memory transfers included, CUFFT is <2x faster than threaded FFTW, and actually slower for 3d transform
-- this is with "pinned" memory using the cudaMallocHost function, with pageable memory it will be even slower

For non-power of 2 FFTs, CUFFT is at most 2x faster than threaded FFTW
-- for 3D non-power of 2, it is at best 50% faster

It doesn't make much sense to use this as an FFT accelerator, have to increase the arithmetic intensity of processing done on the GPU to make it useful



1 dimensional

2 dimensional

3 dimensional