Loading…

Evaluating FFT-based algorithms for strided convolutions on ARMv8 architectures

Convolutional Neural Networks (CNNs) have been widely adopted in many kinds of artificial intelligence applications. Most of the computational overhead of CNNs is spent on convolutions. An effective approach to reducing the overhead is transforming convolutions in the time domain into multiplication...

Full description

Saved in:

Bibliographic Details
Published in:	Performance evaluation 2021-12, Vol.152, p.102248, Article 102248
Main Authors:	Huang, Xiandong, Wang, Qinglin, Lu, Shuyu, Hao, Ruochen, Mei, Songzhu, Liu, Jie
Format:	Article
Language:	English
Subjects:	ARMv8 CNNs FFT Parallel algorithm Strided convolutions
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Convolutional Neural Networks (CNNs) have been widely adopted in many kinds of artificial intelligence applications. Most of the computational overhead of CNNs is spent on convolutions. An effective approach to reducing the overhead is transforming convolutions in the time domain into multiplications in the frequency domain by means of Fast Fourier Transform (FFT) algorithms, known as FFT-based fast algorithms for convolutions. However, current FFT-based fast implementations only work for unit-strided convolutions with stride as 1, and cannot be directly applied to strided convolutions with stride size greater than 1, which are usually used as the first layer of CNNs and as an effective alternative to the pooling layers for downsampling. In this paper, we first introduce rearrangement- and sampling-based methods for applying FFT-based fast algorithms to strided convolutions, and the arithmetic complexities of these two methods and the direct method are compared in detail. Then, the highly optimized parallel implementations of the two methods on ARMv8-based many-core CPU are presented. Lastly, we benchmark these implementations against two GEMM-based implementations on this ARMv8 CPU. Our experimental results with convolutions of different kernels, feature maps, and batch sizes show that the rearrangement-based method generally exceeds the sampling-based one under the same optimizations in most cases, and these two methods can get much better performance than GEMM-based ones when the kernels, feature maps and batch sizes are large. The experimental results on the convolutional layers in popular CNNs further demonstrate the conclusions above.
ISSN:	0166-5316 1872-745X
DOI:	10.1016/j.peva.2021.102248