Loading…

Enabling an OpenCL Compiler for Embedded Multicore DSP Systems

OpenCL is an industry's attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors and SIMD instructions as well as data local...

Full description

Saved in:
Bibliographic Details
Main Authors: Jia-Jhe Li, Chi-Bang Kuan, Tung-Yu Wu, Jenq Kuen Lee
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:OpenCL is an industry's attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors and SIMD instructions as well as data locality with memory hierarchy. Recently, OpenCL has gained success on many architectures, including multicore CPUs, GPUs, vector processors, embedded systems with application-specific processors, and even FPGAs. However, how to support OpenCL for embedded multicore DSP systems remains unaddressed. In this paper, we illustrate our OpenCL support for embedded multicore DSP systems. Our target platform consists of one MPU and a DSP subsystem with multiple DSPs. The DSPs we address are VLIW processors with clustered functional units and distributed register files. To generate efficient code for such DSPs, compilers are required to consider irregular register file access in many optimization phases. To utilize the DSPs with distributed register files, we propose a cluster-aware work-item dispatching scheme to vectorize OpenCL kernels and assign independent workload to clusters of a DSP. In addition, we also incorporate several optimizations to enable efficient DSP code generation. In our experiments, we employ a set of OpenCL benchmark programs to evaluate the effectiveness of our OpenCL support. The experiments are conducted on a DSP cycle-accurate simulator and a multicore evaluation board. We report average 29% performance improvement with our vectorization scheme and a near 2-fold speedup with two DSPs compared with a single-MPU setup.
ISSN:0190-3918
2332-5690
DOI:10.1109/ICPPW.2012.74