Loading…

PrSpMV: An Efficient Predictable Kernel for SpMV

Sparse Matrix-Vector Multiplication (SpMV) has been widely applied in scientific computation, industry simulation, and intelligent computation domains, which is the critical algorithm in all these applications. Due to the poor data locality, low cache usage, and extremely irregular branch patterns c...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fu, Gelin, Xia, Tian, Qu, Shaoru, Luo, Zhongpei, Li, Shuyu, Cheng, Pengyu, Guo, Runfan, Ding, Yitong, Ren, Pengju
Format:	Conference Proceeding
Language:	English
Subjects:	Branch Prediction Data Prefetching Sparse Matrix Computation
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Sparse Matrix-Vector Multiplication (SpMV) has been widely applied in scientific computation, industry simulation, and intelligent computation domains, which is the critical algorithm in all these applications. Due to the poor data locality, low cache usage, and extremely irregular branch patterns caused by the highly sparse and random distributions, SpMV optimization has become one of the most challenging problems for modern high-performance processors. In this paper, we study the bottlenecks of SpMV on current out-of-order CPUs and propose a novel SpMV kernel named PrSpMV to improve its performance by pursuing high predictability. Specifically, we improve the memory access regularity and locality by creating serialized access patterns so that the data prefetching efficiency and cache usage are optimized. We also improve pipeline efficiency by creating regular branch patterns to make branch prediction more accurate. Experiment results show that using the above optimization approaches, PrSpMV can eliminate nearly all branch mispredictions. Moreover, it can also significantly reduce the average L2 cache miss rate from 57% to 20% via efficiently leveraging hardware prefetchers. By using PrSpMV, stride prefetcher can be boosted with 1.31× speedup and dedicated irregular prefetcher can be improved with 1.40× speedup. Meanwhile, on commercial high-end Intel processors, it achieves 1.32× speedup against some state-of-the-art SpMV kernels.
ISSN:	2576-6996
DOI:	10.1109/ICCD58817.2023.00075