Loading…

Run-Time Efficient RNN Compression for Inference on Edge Devices

Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve signifi...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2020-08
Main Authors:	Thakker, Urmish, Beu, Jesse, Gope, Dibakar, Dasika, Ganesh, Mattina, Matthew
Format:	Article
Language:	English
Subjects:	Constraints Electronic devices Inference Model accuracy Pruning Recurrent neural networks Time compression
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper explores a new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) that achieves this dual objective. This scheme divides the weight matrix into two parts - an unconstrained upper half and a lower half composed of rank-1 blocks. This results in output features where the upper sub-vector has "richer" features while the lower-sub vector has "constrained features". HMD can compress RNNs by a factor of 2-4x while having a faster run-time than pruning (Zhu &Gupta, 2017) and retaining more model accuracy than matrix factorization (Grachev et al., 2017). We evaluate this technique on 5 benchmarks spanning 3 different applications, illustrating its generality in the domain of edge computing.
ISSN:	2331-8422
DOI:	10.48550/arxiv.1906.04886