- 1 Overview
- 2 FPGA Background
- 3 The SHARCNET FPGA System
- 4 FPGA Programming Overview
- 5 Mitrion C
- 5.1 Current Status
- 5.2 Usage at SHARCNET
- 5.3 Features and aspects of the language
- 5.4 Development
- 6 Example Walkthrough
- 7 Documentation
- 8 BLASTN Mitrion Virtual Processor
- 9 Examples of Successful Ports to Mitrion C / SGI RASC
Note: the system that provided this functionality ( School ) has since been decommissioned. We've left this information here for posteriety but at present there are no systems at SHARCNET supporting this model of FPGA computing.
This tutorial provides an introduction to the use of FPGA hardware as a co-processor accelerator using SHARCNET resources. It covers FPGA background, programming FPGAs with the Mitrionics programming system, and using the BLASTN Mitrion virtual processor.
This tutorial is a work in progress - if you have any suggestions for information to add or find anything to be incorrect or broken please submit a ticket to the problem tracking system.
FPGA is an acronym for field programmable gate array. It is basically a programmable hardware device, which can be reconfigured to run different algorithms in hardware, rather than processing a stream of instructions as is done in a typical microprocessor. While FPGAs see heavy use in areas like electrical engineering for integrated circuit prototyping, they have yet to make substantial inroads in the HPC market since they have traditionally been very expensive and notoriously difficult to program.
In recent years developments in high-level software tools that target FPGAs have made them much easier to use as HPC accelerators. They are now utilized as integral components in HPC systems, having been incorporated in SGI Altix and Cray X-series supercomputers. Recent developments include FPGAs provided as PCIe cards (in the same fashion as GPUs) or even as CPU-socket add-ons.
The use of FPGAs in HPC is targeted at data-intensive applications that spend nearly all of their time in a particular mathematical kernel, which exhibits finely-grained parallelism, both in terms of being able to provide many independent steams of data as well as pipelining operations on each stream. It will not aid complex programs that are constrained by Ahmdal's Law. FPGAs are best at tasks that use short word length integer or fixed point data, and exhibit a high degree of parallelism.
The SHARCNET FPGA System
SHARCNET has deployed an SGI Altix RASC (reconfigurable application specific computing) system at Wilfrid Laurier University, which is named school.sharcnet.ca. For further information on this system, please see School_(FPGA_system).
FPGA Programming Overview
In the past users had to program an FPGA with a hardware description language like VHDL/Verilog, a process that required knowledge of circuit timing considerations and other complex hardware issues.
New high-level languages have been developed that abstract the FPGA system to the level where a user can program in typical C/Fortran fashion and have that code automatically translated to a description language. The implementation we've gone with for school is Mitrion-C.
Once the hardware description for the FPGA has been created, one must then compile a device specific bitstream which is actually loaded onto the FPGA (much like firmware for a device in a PC). The process of compiling the bitstream is commonly referred to as "place and route", or "synthesis". The synthesis software is usually provided by the hardware vendor who manufactures the FPGA, in the case of the RC100 FPGA blade this is Xilinx. Their tool, the Xilinx ISE, is integrated into the Mitrion-C SDK, which greatly simplifies this stage of the development process. Unfortunately, the Xilinx software does not run on Itanium, hence the need for a second x86 based system (tope) to compile the bitstream.
Once the bitstream has been compiled (it is called a "virtual processor" in Mitrion parlance) it can be loaded onto the FPGA using the devmgr RASC utility, which manages bitstreams in a database and loads them when a host program specifies that they should be used.
In addition to the Mitrion virtual processor, one must also instrument their host system program to interface with the FPGA. There are a number of APIs for doing so, Mithal, the Mitrion abstraction layer, is a C and Fortran interface to the virtual processor on the FPGA provided by Mitrion. An alternative approach is to use the RASCAL (RASC abstraction layer) which is provided by SGI. General FPGA programming Issues
- FPGA programming is highly non-portable.
- Cannot expect code written for a particular FPGA or using a particular SDK to work elsewhere.
- Place and route times are substantial - may require over 1 day of computing time just to build the FPGA design.
- There is only a limited amount of resources on the FPGA.
- large kernels, especially ones using double precision, can quickly exhaust available resources.
Important Update (2010 06)!: Mitrionics is closing down operations. As such, SHARCNET has no support for the mitrion platform moving forward. This tutorial may be of interest and users are certainly able to continue using the software at SHARCNET, however, should one wish to move forward with FPGA programming beyond the VHDL/Verilog level it is recommended that they learn a different platform. The Mitrionics IP is being sold (although there was some potential for it to be open-sourced the shareholders decided against it).
Usage at SHARCNET
NOTE: The Mitrion-C SDK (and in particular, the Xilinx ISE) are not supported on the Itanium architecture, so users should not use the SDK on school. This includes the resource intensive and time-consuming synthesis stage, which must be done on a seperate machine. Code development and simulation can be done on most platforms. To compile a bitstream one has to use tope, including the supported version of Mitrion-C which is installed on the system.
Features and aspects of the language
- At face value similar to many high level languages like C or Fortran
- implicitly parallel language (user doesn't control threads or actively code parallelism)
- centers around intrinsic parallelism and data-dependencies while traditional languages are sequential and focused on order-of-execution.
- Single-Assignment Language (operations can and do occur out of order)
- Can use C pre-processor
- No global variables
- scalar types include int,uint,bool,float,bits with same meaning as in C
- ability to use any type of precision; eg. uint:12 == 12 bit unsigned integer, float:10.9 == float with 10 bit mantissa and 9 bit exponent
- collections of scalars include:
- lists: an ordered stream of data with a specific length, no indexing
- vector: same as list, but can be accessed in any order (it's indexed) and uses much more buffer space on die
- stream: dynamic length, may contain vectors,scalars,tuples or other streams. can't contain a list
- all collections can be multidimensional
- tuples: create items for other collections with mixed types
- similar to C / Fortran
- all variables have to be passed in / out of functions
- don't have to be fully typed (can be polymorphic)
- can be nested
- ability to call external VHDL IP blocks
- intrinsics exist for some functions, as well as typical operators for different scalar types
- parallel operation on a collective
- sequential (iterative) operation over a collective
- requires at least one variable which has a loop iteration dependancy
- sequential operation over a collective of indeterminate length (specified by data itself)
- when program starts it expects all input to be in external ram banks, when it ends results are stored there
- tokens to reference these are passed to main()
- can use internal ram banks for intermediate data, created with _memcreate
- internal ram banks are accessed via instance tokens, these order the I/O
One typically starts with a high-level program in C/C++/Fortran/etc that shows potential for being accelerated. It should be amenable to fine-grained parallelism, as the performance gains in FPGAs are obtained by designing the algorithm such that many independent operations can occur simultaneously. The clock frequency of an FPGA is typically slower than on a conventional CPU, but one can do much more in one clock on the FPGA than on the CPU.
One should have a very good idea of how long various parts of their program take to run, or in other words, the program should be profiled. It is important to determine that the nearly all of the execution time occurs during one or more subroutines or functions, all of which must be parallelizable. The degree to which a particular algorithm can be accelerated is commonly known as Ahmdal's Law, and applies to any parallel system.
If the code depends on a language feature that does not exist in Mitrion-C, one can attempt to program the missing functionality. It may be that the particular algorithm is poorly suited for the FPGA and / or Mitrion.
Interacting with the FPGA
Transfer of data and activation of the accelerated portion of the code on the FPGA is accomplished by implementing either the SGI RASCAL API or the Mitrion Mithal API. This does not require significant modification to the host cpu code, but it is something that one has to take into account when porting their program.
Mitrion SDK PE
The Mitrion SDK PE will allow developers to write and simulate programs for the FPGA. This development platform can be obtained for free after registering at this registration page. It is available for Windows, Linux and Mac OS X. It is not necessary for users to download and install this software, as the official SDK version is accessible on tope, but it may provide for a better development experience.
With respect to linux, a number of packages are available for standard distributions. The only requirement for the software is that one has to have the Sun Java JRE installed, with at least version 1.4. I've encountered severe memory leaks trying to use the IDE with version 1.6, but 1.5 seems to work fine. After installing, users may want to add the shared library path;
Keep in mind that the first time you run the mitrion server, you should do so by hand and enter the validation code that you were emailed after registering.
For a quick example that shows how to compile code and run the simulator, see the example in:
One can also use the mitrion-ide gui interface. A quick way to test out examples is by following the procedure to load all of the program demos into the IDE, which is outlined here:
A good place to send questions or inquire about Mitrion C is the My Mitrion web forum. One can register for a free account.
Creating Bitstreams (synthesis)
Once a user is confident of their code, they will then have to compile the FPGA program (synthesis/place and route) on the x86_64 staging server containing the full Mitrion SDK product.This machine is tope. This will produce a binary file that can then be loaded on school.
This command sequence illustrates the process of compiling an example Mitrion program (Swap) on SHARCNET using tope and school:
ssh tope.sharcnet.ca cd /work/$USER cp -r /opt/sharcnet/mitrion/current/sdk-xl/mitrion/doc/examples/RC100 ./mitrion_examples cd mitrion_examples/Swap make bitstream ssh school cd /work/$USER/mitrion_examples/Swap
Now we need to modify the Makefile in the current directory to point to the Mithal installation on school. Note the block of code at the top of the Makefile that sets the variable
MITHAL_ROOT. Change the path to correspond to the SDK install on school:
ifeq ($(shell uname -s), Darwin) MITHAL_ROOT=/usr/share/mitrion/mithal else # MITHAL_ROOT=/opt/mitrion/mithal MITHAL_ROOT=/opt/sharcnet/local/mitrion/sdk/1.4.1/share/mitrion/mithal endif
Now compile the host program on the Itanium, load the bitstream into the FPGA registry, and run the program:
make fpgahost devmgr -a -n swap.$USER.151 -b /work/$USER/mitrion_examples/Swap/mitrion_xst.bin -c /work/$USER/mitrion_examples/Swap/bitstream.cfg -s /work/$USER/mitrion_examples/Swap/core_services.cfg ./swap
- Mitrion C Language: The Mitrion C programming language guide.
- RASC User Guide: This is the online SGI RASC user guide, including information on how to use RASCAL (RASC abstraction layer).
- Mithal: The Mitrion host abstraction layer API guide. This is a C and Fortran API for interfacing with the Mitrion virtual processor from the host program.
- Mitrion on RASC: Details concerning using Mitrion on SGI RASC systems.
- Mitrion on the RC100 Compute Blade: Details concerning using Mitrion on the SGI RC100 Compute Blade.
- Mitrion SDK: Installation and use of the full Mitrion SDK product (not the same as the PE SDK).
- Quick Reference Page: A Mitrion quick reference card.
BLASTN Mitrion Virtual Processor
NOTE: In order to use this software, users must be members of the fpga group. To join this group, please submit a request to the problem ticket system.
The official Mitrion BLASTN Virtual Processor can be found on school in:
This contains the Mitrion-C code for BLAST as well as a modified version of NCBI BLAST 2.2.13 for replacing the calculation core with calls to the FPGA.
The following is for version 1.1, the documentation for the most recent version can be found in the docs directory of the BLAST installation.
The following example shows how to run the FPGA accelerated blastn program on school using the example ecoli data:
cd /opt/sharcnet/local/mitrion/blast/1.1/mitc-blast-1.1/ncbi/bin/ blastall -fpga -p blastn -i ../../test/sample-query -d ../../test/ecoli.nt
One can run blastall without arguments to see a list of options, and is further explained in the Users Guide.
Examples of Successful Ports to Mitrion C / SGI RASC
- project to accelerate key bioinformatics programs on FPGAs
- currently have implemented BLASTN (nucleotide search) and are working on BLASTP (protein search)
- open source, runs on our software / hardware
- The first version of Mitrion-Accelerated BLAST BLASTN searches shows a 20x faster total run time, wall clock time, per chip compared to a contemporary Itanium2 cpu
- The performance of the accelerated parts of the search, operating on the FPGA, is 35x faster
- not clear if this is a winner versus running mpiBLAST on a cluster
- it's fully featured (already does P searches, etc) and scales well (even super-linear) to at least 128 procs
- two point angular correlation function using Mitrion-C on the RC-100
- measured 9.5x speedup over optimized C implementation on host cpu (itanium2), potential of up to 20x