Publication: Accelerating Nonlinear System Identification Using the Cell BE Processor

All || By Area || By Year

Title Accelerating Nonlinear System Identification Using the Cell BE Processor
Authors/Editors* R. Rohra, R. Kanagasundaram, J.R. Green
Where published* 2nd Annual Carleton Cell BE Programming Workshop, Carleton University, 13-15 May 2009.
How published* Other
Year* 2009
Intensive scientific computing applications involving nonlinear system identification have traditionally been compute bound. The Cell Broadband Engine (Cell B.E.) has demonstrated its suitability for achieving high performance gains for computationally intensive scientific computing applications. Parallel cascade identification (PCI) is a method for modeling dynamic systems with possibly high order nonlinearities and lengthy memory, given only input/output data for the system to be modeled. This project ports PCI to the Cell B.E. to achieve high speed up and permit novel application of PCI to challenges in biomedical informatics. PCI gets its name from the parallel cascades of dynamic linear and static nonlinear components that are used to model a system. The dynamic linear component is composed of a finite impulse response (FIR) filter. The FIR filter taps can be determined using cross-correlation between the sample input and output data. The filter response is obtained via convolution, and is then used as the input to the static nonlinear component. Fast Orthogonal Search (FOS) polynomial fitting is applied to best-fit the FIR filter response to the desired cascade output, thereby determining the optimal coefficients of the Nth order polynomial that forms the static nonlinear component. Application of the polynomial to the FIR filter output produces the final cascade model output. Thus, the principle algorithms to be accelerated on the Cell BE synergistic processing elements (SPEs) are: 1) cross-correlation, 2) convolution, 3) FOS polynomial fitting, and 4) application of a polynomial to a vector. To achieve a high order of speedup, the basic algorithm has been modified considerably to best utilize the resources offered by the Cell B.E. Various different opportunities for optimizations have been exploited by using optimization techniques such as data parallelism, direct memory access (DMA), double buffering, multiple buffering, single instruction multiple data (SIMD), loop unrolling, and branch elimination. A requirement to utilizing some of these techniques is to vectorize the data structures. The complex data structures are thus decomposed into simple vectors more amenable to SIMD acceleration. The simple vectors are memory aligned to quadword boundaries and are zero padded for facilitating DMA transfers and simplifying the filter response (convolution) algorithm. This decomposition happens on-demand and the simple vectors are persisted until the data structures are modified for efficient execution. The simple vectors are distributed while ensuring contiguous memory and data span boundaries amongst the SPE’s of the Cell B.E. to attain data parallelism. Different algorithms were evaluated and implemented for the dynamic linear and static nonlinear components. The performance of each algorithm was evaluated using both the cycle-accurate Cell BE simulator, and also the Sharcnet Cell BE cluster housed at the University of Western Ontario. The dynamic linear component consists of two algorithmically similar steps; namely cross-correlation and convolution. Different SIMD strategies were used for implementing these two algorithms. The algorithm for computing cross-correlation was found to be faster as the SIMD strategy used the spu_madd intrinsic (multiply-and-add) which can perform eight floating point operations per cycle. Two different algorithms were evaluated for the polynomial fitting step of the static nonlinear component. The two algorithms demonstrated a trade-off between speed versus accuracy. The project goal of achieving high speed up has been achieved with the different components showing high performance gains. For example the first order cross-correlation has achieved an absolute speed up of 33,000% and an effective speed up (inclusive of data structure decomposition) of 500%. It is hoped that by sufficiently accelerating the PCI nonlinear system identification algorithm, novel combinations of genetic algorithms and PCI may find application in biomedical informatics applications – an area that has been compute bound to date.
Go to Bioinformatics
Back to page 37 of list