Loading…

A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility

General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Jialin, Ye, Huang, Tian, Shaobo, Li, Xinyuan, Zhang, Jian
Format:	Conference Proceeding
Language:	English
Subjects:	AMD GCN Architecture DGEMM Graphics processing units High performance computing Libraries Mathematical models Parallel processing Performance gain Prefetching Register TLP Workgroup Parallelism
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.
ISSN:	1530-2075
DOI:	10.1109/IPDPS53621.2022.00089