Publication: Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

All || By Area || By Year

Title	Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU
Authors/Editors*	Scott Rostrup and Hans De Sterck
Where published*	Computers Physics Communications
How published*	Journal
Year*	2010
Volume
Number
Pages
Publisher
Keywords
Link
Abstract	Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational per- formance. Two technologies that have received significant attention are IBMâs Cell Processor and NVIDIAâs CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code perfor- mance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Go to Scientific Computing
Back to page 22 of list