Loading…

GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Summary Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing...

Full description

Saved in:
Bibliographic Details
Published in:Concurrency and computation 2015-09, Vol.27 (14), p.3771-3789
Main Authors: Tao, Yuan, Deng, Yangdong, Mu, Shuai, Zhang, Zhenzhong, Zhu, Mingfa, Xiao, Limin, Ruan, Li
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Summary Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi‐core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU‐based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall‐clock time of bi‐conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright © 2014 John Wiley & Sons, Ltd.
ISSN:1532-0626
1532-0634
DOI:10.1002/cpe.3415