FPGA Code Acceleration



This tutorial provides an introduction to the use of FPGA hardware as a co-processor accelerator using SHARCNET resources. It covers FPGA background, programming FPGAs with the Mitrionics programming system, and using the BLASTN Mitrion virtual processor.

This tutorial is a work in progress - if you have any suggestions for information to add or find anything to be incorrect or broken please submit a ticket to the problem tracking system.


One should understand the use of bash shell variables and basic shell commands. Familiarity with traditional high-level programming languages like C and Fortran is beneficial for learning Mitrion C.


FPGA Background

FPGA is an acronym for field programmable gate array. It is basically a programmable hardware device, which can be reconfigured to run different algorithms in hardware, rather than processing a stream of instructions as is done in a typical microprocessor. While FPGAs see heavy use in areas like electrical engineering for integrated circuit prototyping, they have yet to make substantial inroads in the HPC market since they have traditionally been very expensive and notoriously difficult to program.

In recent years developments in high-level software tools that target FPGAs have made them much easier to use as HPC accelerators. They are now utilized as integral components in HPC systems, having been incorporated in SGI Altix and Cray X-series supercomputers.

The use of FPGAs in HPC is targeted at data-intensive applications that spend nearly all of their time in a particular mathematical kernel, which exhibits finely-grained parallelism, both in terms of being able to provide many independent steams of data as well as pipelining operations on each stream. It will not aid complex programs that are constrained by Ahmdal's Law. FPGAs are best at tasks that use short word length integer or fixed point data, and exhibit a high degree of parallelism.

Back to Index


SHARCNET has deployed an SGI Altix RASC (reconfigurable application specific computing) system at Wilfrid Laurier University, which is named The system is an SGI Altix 450 containing an RC100 compute blade with dual Virtex 4-LX200 FPGAs, 8 Itanium-2 processors at 1.6GHz and 16GB of system RAM.

SHARCNET also has licenses for the Mitrion SDK and 2 Mitrion Virtual Processors, such that both FPGAs can be utilized with Mitrion C code simultaneously, as well as an officially supported, pre-compiled Mitrion BLASTN virtual processor, which can be used to accelerate BLASTN searches on the FPGA without having to do any programming.

Due to software incompatabilities, the FPGA configuration bitstream has to be compiled on a non-Itanium system. SHARCNET provides for this purpose. Users should be able to log in and access the Mitrion SDK without any modification to their shell environment.

The RASC User Guide is freely available online for further information about the FPGA system.

Users of school email should submit a problem ticket to request to be added to the fpga group. This is neccessary to be able to use the BLASTN virtual processor.

Back to Index

FPGA Programming Overview

In the past users had to program an FPGA with a hardware description language like VHDL/Verilog, a process that required knowledge of circuit timing considerations and other complex hardware issues.

New high-level languages have been developed that abstract the FPGA system to the level where a user can program in typical C/Fortran fashion and have that code automatically translated to a description language. The implementation we've gone with for school is Mitrion-C.

Once the hardware description for the FPGA has been created, one must then compile a device specific bitstream which is actually loaded onto the FPGA (much like firmware for a device in a PC). The process of compiling the bitstream is commonly referred to as "place and route", or "synthesis". The synthesis software is usually provided by the hardware vendor who manufactures the FPGA, in the case of the RC100 FPGA blade this is Xilinx. Their tool, the Xilinx ISE, is integrated into the Mitrion-C SDK, which greatly simplifies this stage of the development process. Unfortunately, the Xilinx software does not run on Itanium, hence the need for a second x86 based system (tope) to compile the bitstream.

Once the bitstream has been compiled (it is called a "virtual processor" in Mitrion parlance) it can be loaded onto the FPGA using the devmgr RASC utility, which manages bitstreams in a database and loads them when a host program specifies that they should be used.

In addition to the Mitrion virtual processor, one must also instrument their host system program to interface with the FPGA. There are a number of APIs for doing so, Mithal, the Mitrion abstraction layer, is a C and Fortran interface to the virtual processor on the FPGA provided by Mitrion. An alternative approach is to use the RASCAL (RASC abstraction layer) which is provided by SGI.

General FPGA programming Issues

  • FPGA programming is highly non-portable.
    • Cannot expect code written for a particular FPGA or using a particular SDK to work elsewhere.
  • Place and route times are substantial - may require over 1 day of computing time just to build the FPGA design.
  • There is only a limited amount of resources on the FPGA.
    • large kernels, especially ones using double precision, can quickly exhaust available resources.
Back to Index

Mitrion C

NOTE: The Mitrion-C SDK (and in particular, the Xilinx ISE) are not supported on the Itanium architecture, so users should not use the SDK on school. This includes the resource intensive and time-consuming synthesis stage, which must be done on a seperate machine. Code development and simulation can be done on most platforms. To compile a bitstream one has to use tope, including the supported version of Mitrion-C which is installed on the system.

Features and aspects of the language


  • At face value similar to many high level languages like C or Fortran
  • implicitly parallel language (user doesn't control threads or actively code parallelism)
    • centers around intrinsic parallelism and data-dependencies while traditional languages are sequential and focused on order-of-execution.
  • Single-Assignment Language (operations can and do occur out of order)
    • Can use C pre-processor
    • No global variables


  • scalar types include int,uint,bool,float,bits with same meaning as in C
    • ability to use any type of precision; eg. uint:12 == 12 bit unsigned integer, float:10.9 == float with 10 bit mantissa and 9 bit exponent
  • collections of scalars include:
    • lists: an ordered stream of data with a specific length, no indexing
    • vector: same as list, but can be accessed in any order (it's indexed) and uses much more buffer space on die
    • stream: dynamic length, may contain vectors,scalars,tuples or other streams. can't contain a list
    • all collections can be multidimensional
    • tuples: create items for other collections with mixed types


  • similar to C / Fortran
    • all variables have to be passed in / out of functions
  • don't have to be fully typed (can be polymorphic)
  • can be nested
  • ability to call external VHDL IP blocks
  • intrinsics exist for some functions, as well as typical operators for different scalar types

block expressions

  • foreach
    • parallel operation on a collective
  • for
    • sequential (iterative) operation over a collective
    • requires at least one variable which has a loop iteration dependancy
  • while
    • sequential operation over a collective of indeterminate length (specified by data itself)

data I/O

  • when program starts it expects all input to be in external ram banks, when it ends results are stored there
    • tokens to reference these are passed to main()
  • can use internal ram banks for intermediate data, created with _memcreate
    • internal ram banks are accessed via instance tokens, these order the I/O


Program Suitability

One typically starts with a high-level program in C/C++/Fortran/etc that shows potential for being accelerated. It should be amenable to fine-grained parallelism, as the performance gains in FPGAs are obtained by designing the algorithm such that many independent operations can occur simultaneously. The clock frequency of an FPGA is typically slower than on a conventional CPU, but one can do much more in one clock on the FPGA than on the CPU.

One should have a very good idea of how long various parts of their program take to run, or in other words, the program should be profiled. It is important to determine that the nearly all of the execution time occurs during one or more subroutines or functions, all of which must be parallelizable. The degree to which a particular algorithm can be accelerated is commonly known as Ahmdal's Law, and applies to any parallel system.

If the code depends on a language feature that does not exist in Mitrion-C, one can attempt to program the missing functionality. It may be that the particular algorithm is poorly suited for the FPGA and / or Mitrion.

Interacting with the FPGA

Transfer of data and activation of the accelerated portion of the code on the FPGA is accomplished by implementing either the SGI RASCAL API or the Mitrion Mithal API. This does not require significant modification to the host cpu code, but it is something that one has to take into account when porting their program.

Mitrion SDK PE

The Mitrion SDK PE will allow developers to write and simulate programs for the FPGA. This development platform can be obtained for free after registering at this registration page. It is available for Windows, Linux and Mac OS X. It is not necessary for users to download and install this software, as the official SDK version is accessible on tope, but it may provide for a better development experience.

With respect to linux, a number of packages are available for standard distributions. The only requirement for the software is that one has to have the Sun Java JRE installed, with at least version 1.4. I've encountered severe memory leaks trying to use the IDE with version 1.6, but 1.5 seems to work fine. After installing, users may want to add the shared library path;

export LD_LIBRARY_PATH="/usr/share/mitrion/mithal/libs/sim/linux-x86_64:$LD_LIBRARY_PATH"

Keep in mind that the first time you run the mitrion server, you should do so by hand and enter the validation code that you were emailed after registering.

For a quick example that shows how to compile code and run the simulator, see the example in:


One can also use the mitrion-ide gui interface. A quick way to test out examples is by following the procedure to load all of the program demos into the IDE, which is outlined here:


Mitrion Help

A good place to send questions or inquire about Mitrion C is the My Mitrion web forum. One can register for a free account.

Creating Bitstreams (synthesis)

Once a user is confident of their code, they will then have to compile the FPGA program (synthesis/place and route) on the x86_64 staging server containing the full Mitrion SDK product.This machine is tope. This will produce a binary file that can then be loaded on school.

Example Walkthrough

This command sequence illustrates the process of compiling an example Mitrion program (Swap) on SHARCNET using tope and school:

cd /work/$USER
cp -r /opt/sharcnet/mitrion/current/sdk-xl/mitrion/doc/examples/RC100 ./mitrion_examples
cd mitrion_examples/Swap
make bitstream
ssh school
cd /work/$USER/mitrion_examples/Swap

Now we need to modify the Makefile in the current directory to point to the Mithal installation on school. Note the block of code at the top of the Makefile that sets the variable MITHAL_ROOT. Change the path to correspond to the SDK install on school:

ifeq ($(shell uname -s), Darwin)
   # MITHAL_ROOT=/opt/mitrion/mithal

Now compile the host program on the Itanium, load the bitstream into the FPGA registry, and run the program:

make fpgahost
# one long line below
devmgr -a -n swap.$USER.151 -b /work/$USER/mitrion_examples/Swap/mitrion_xst.bin -c /work/$USER/mitrion_examples/Swap/bitstream.cfg -s /work/$USER/mitrion_examples/Swap/core_services.cfg
# one long line above


Mitrion C Language: The Mitrion C programming language guide.
RASC User Guide: This is the online SGI RASC user guide, including information on how to use RASCAL (RASC abstraction layer).
Mithal: The Mitrion host abstraction layer API guide. This is a C and Fortran API for interfacing with the Mitrion virtual processor from the host program.
Mitrion on RASC: Details concerning using Mitrion on SGI RASC systems.
Mitrion on the RC100 Compute Blade: Details concerning using Mitrion on the SGI RC100 Compute Blade.
Mitrion SDK: Installation and use of the full Mitrion SDK product (not the same as the PE SDK).
Quick Reference Page: A Mitrion quick reference card.

Back to Index

BLASTN Mitrion Virtual Processor

NOTE: In order to use this software, users must be members of the fpga group. To join this group, please submit a request to the problem ticket system.

The official Mitrion BLASTN Virtual Processor can be found on school in:


This contains the Mitrion-C code for BLAST as well as a modified version of NCBI BLAST 2.2.13 for replacing the calculation core with calls to the FPGA.


The following is for version 1.1, the documentation for the most recent version can be found in the docs directory of the BLAST installation.

Mitrion BLAST QuickStart Guide
Mitrion BLAST Users Guide

Quick Example

The following example shows how to run the FPGA accelerated blastn program on school using the example ecoli data:

cd /opt/sharcnet/local/mitrion/blast/1.1/mitc-blast-1.1/ncbi/bin/
blastall -fpga -p blastn -i ../../test/sample-query -d ../../test/ecoli.nt

One can run blastall without arguments to see a list of options, and is further explained in the Users Guide.

Back to Index

Examples of Successful Ports to Mitrion C / SGI RASC

Mitrion-C Open Bio Project

  • project to accelerate key bioinformatics programs on FPGAs
  • currently have implemented BLASTN (nucleotide search) and are working on BLASTP (protein search)
  • open source, runs on our software / hardware
  • The first version of Mitrion-Accelerated BLAST BLASTN searches shows a 20x faster total run time, wall clock time, per chip compared to a contemporary Itanium2 cpu
    • The performance of the accelerated parts of the search, operating on the FPGA, is 35x faster
  • not clear if this is a winner versus running mpiBLAST on a cluster
    • it's fully featured (already does P searches, etc) and scales well (even super-linear) to at least 128 procs

Two Point Angular Correlation Function

  • two point angular correlation function using Mitrion-C on the RC-100
  • measured 9.5x speedup over optimized C implementation on host cpu (itanium2), potential of up to 20x

Back to Index