Loading…

Run-Time Efficient RNN Compression for Inference on Edge Devices

Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve signifi...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2020-08
Main Authors: Thakker, Urmish, Beu, Jesse, Gope, Dibakar, Dasika, Ganesh, Mattina, Matthew
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper explores a new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) that achieves this dual objective. This scheme divides the weight matrix into two parts - an unconstrained upper half and a lower half composed of rank-1 blocks. This results in output features where the upper sub-vector has "richer" features while the lower-sub vector has "constrained features". HMD can compress RNNs by a factor of 2-4x while having a faster run-time than pruning (Zhu &Gupta, 2017) and retaining more model accuracy than matrix factorization (Grachev et al., 2017). We evaluate this technique on 5 benchmarks spanning 3 different applications, illustrating its generality in the domain of edge computing.
ISSN:2331-8422
DOI:10.48550/arxiv.1906.04886