HSA - Chair of Computer Science 3 (Computer Architecture)

HSA on FPGA

Recently emerging applications like deep learning, ultra-high definition image/video processing or virtual reality require a vast amount of processing power in increasingly smaller devices never seen before. Heterogeneous architectures are seen as a solution to this problem by both, industry and academia since they provide a significantly better performance per watt. In addition to the more common GPUs and DSPs, Field Programmable Gate Arrays (FPGAs) further enhance the selection of hardware to system designers. Internally, they are mostly made up of simple lookup tables and flip-flops. These components are then dynamically configured to form complex algorithms directly in hardware. This structure offers unlimited reconfigurability, low latency and most importantly very high energy efficiency due to the dedicated circuitry.

These unique features make them an indispensable part of the solution to these imminent problems. However, their inherent complexity renders their accessibility significantly worse compared to other kinds of accelerators. Therefore, new ways to interact with FPGAs in heterogeneous systems are needed to use their full potential, especially when they are used in conjunction with other accelerator types.

For this purpose, programming an FPGA must be made considerably simpler for developers. Nowadays, heterogeneous systems typically use Khronos’ open standard “OpenCL” to interact with its devices. It includes the language OpenCL C to write “kernels” which can be offloaded to an accelerator device, as well as a C/C++ API to describe these kernel dispatches from the host software.

In the past, this has been also successfully applied to FPGAs. A method called high-level synthesis (HLS) has been created to make this form of automated hardware design from a higher level language possible. Traditionally, accelerators are designed by hand, which is an expensive and time consuming process, but typically leads to the best results.

This process is not always feasible for example when short and stringent time frames need to be met. However, with HLS, the application behavior is described in a less complex source language and then compiled to synthesizable HDL (hardware description language) code. With this method, the need to write time consuming HDL becomes less important, which considerably improves the accessibility of FPGAs.

However, relying on the OpenCL standard alone is not sufficient for today’s diverse landscape of programming languages. Every developer should be able to use the best suited language for the desired task, without compromises in using accelerators. What is really needed is not only a hardware independent API, but also a language agnostic specification on a sufficiently low level. With this, every kind of accelerator can be targeted from any programming language. This issue is especially urgent for FPGAs as they are inherently harder to use.

The HSA Approach

These management problems of heterogeneous hardware are targeted by the not-for-profit Heterogeneous System Architecture (HSA) Foundation. They published a set of royalty-free specifications to provide a uniform solution for different architectures. Using this reference protocol, compilers can map constructs, which describe parallelism, from any language to the vendor neutral API.

Furthermore, the HSA specification also provides a virtual ISA, called HSAIL, which provides an abstraction for the kernel code itself. A “finalizer” is then used to generate the architecture specific code just in time. Therefore, with an HSA enabled compiler front end and a suitable finalizer, a language and accelerator independent workflow can be realized. However, HSAIL is more closely modeled after regular GPU assembler code than arbitrary logic. For this reason the mapping to FPGAs is significantly more complex than to instruction based accelerators.

A Shared HSA Workflow for FPGAs

Nevertheless, an analysis showed that it is possible to build an HLS-based HSA workflow which seamlessly extends the existing GPU and DSP solutions. A proof-of-concept realized this by automatically combining generated circuits with traditional hardware components like SIMT (single instruction, multiple thread) schedulers and caches.

With this concept, FPGAs can utilize the single source programming paradigm and all available source languages of the HSA Foundation. Moreover, the language front ends as well as the host and HSAIL compiler back ends are completely independent of the accelerator. As a consequence, they can be shared between CPU, GPU, DSP and FPGA. With this hybrid approach, newly supported languages of any vendor can be directly leveraged by FPGAs. Therefore, a greater selection of languages and accelerators is available which significantly improves the productivity of developers. For companies this also reduces the absolute time needed to develop a toolchain for their heterogeneous system.