Loading…
Accelerating Real-Valued FFT on CPU-FPGA Platforms
The real-valued fast Fourier transform (RFFT) is an ideal candidate for implementing a high-speed and low-power FFT processor because it only has approximately half the number of arithmetic operations compared with traditional complex-valued FFT (CFFT). Although RFFT can be calculated using CFFT har...
Saved in:
Published in: | IEEE transactions on computer-aided design of integrated circuits and systems 2024-08, Vol.43 (8), p.2532-2536 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The real-valued fast Fourier transform (RFFT) is an ideal candidate for implementing a high-speed and low-power FFT processor because it only has approximately half the number of arithmetic operations compared with traditional complex-valued FFT (CFFT). Although RFFT can be calculated using CFFT hardware, a dedicated RFFT implementation can result in reduced hardware complexity, power consumption and increased throughput. However, unlike CFFT, RFFT has irregular signal flow graphs which hinders the design of efficient pipelined architectures. In this article, utilizing open computing language (OpenCL), we propose a high-level programming method for the implementation of pipelined architectures of RFFT on FPGAs. By identifying the regular computational pattern in the flow graph of RFFT, the proposed method essentially uses a for loop to implement the RFFT algorithm, and later with the help of high-level synthesis tools, the loop is fully unrolled to automatically build pipelined architectures. Experiments show that for a 4096-point RFFT, the proposed method achieves a 2.49\times speedup and 3.09\times better-energy efficiency over CUFFT on GPU, and a 21.12\times speedup and 16.09\times better-energy efficiency over FFTW on CPU, respectively. Compared to Intel's CFFT design on the same FPGA, the proposed one reduces 12% logic resources and 16% DSP blocks, respectively, while achieving a 1.48\times speedup. |
---|---|
ISSN: | 0278-0070 1937-4151 |
DOI: | 10.1109/TCAD.2024.3377160 |