Loading…
A Multithreaded VLIW Soft Processor Family
Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and repo |
---|---|
DOI: | 10.1109/FCCM.2013.36 |