Loading…

High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2024, Vol.12, p.44163-44189
Main Authors: Urbinati, Luca, Casu, Mario R.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators' performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3380472