Loading…

HTQ: Exploring the High-Dimensional Trade-Off of mixed-precision quantization

Mixed-precision quantization, where more sensitive layers are kept at higher precision, can achieve the trade-off between accuracy and complexity of neural networks. However, the search space for mixed-precision grows exponentially with the number of layers, making the brute force approach infeasibl...

Full description

Saved in:
Bibliographic Details
Published in:Pattern recognition 2024-12, Vol.156, p.110788, Article 110788
Main Authors: Li, Zhikai, Long, Xianlei, Xiao, Junrui, Gu, Qingyi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Mixed-precision quantization, where more sensitive layers are kept at higher precision, can achieve the trade-off between accuracy and complexity of neural networks. However, the search space for mixed-precision grows exponentially with the number of layers, making the brute force approach infeasible on deep networks. To reduce this exponential search space, recent efforts use Pareto frontier or integer linear programming to select the bit-precision of each layer. Unfortunately, we find that these prior works rely on a single constraint. In practice, model complexity includes space complexity and time complexity, and the two are weakly correlated, thus using simply one as a constraint leads to sub-optimal results. Besides this, they require manually set constraints, making them only pseudo-automatic. To address the above issues, we propose High-dimensional Trade-off Quantization (HTQ), which automatically determines the bit-precision in the high-dimensional space of model accuracy, space complexity, and time complexity without any manual intervention. Specifically, we use the saliency criterion based on connection sensitivity to indicate the accuracy perturbation after quantization, which performs similarly to Hessian information but can be calculated quickly (more than 1000× speedup). The bit-precision is then automatically selected according to the three-dimensional (3D) Pareto frontier of the total perturbation, model size, and bit operations (BOPs) without manual constraints. Moreover, HTQ allows for the joint optimization of weights and activations, and thus the bit-precisions of both can be computed concurrently. Compared to state-of-the-art methods, HTQ achieves higher accuracy and lower space/time complexity on various model architectures for image classification and object detection tasks. Code is available at: https://github.com/zkkli/HTQ. •A saliency criterion based on connection sensitivity is proposed, enabling over 1000× computational speedup.•3D Pareto frontier is developed to automatically select the bit-precision in the high-dimensional space.•Weights and activations are jointly optimized, allowing the bit-precisions of both to be computed concurrently.•HTQ achieves a better trade-off between model accuracy and complexity compared to the previous methods.
ISSN:0031-3203
DOI:10.1016/j.patcog.2024.110788