Loading…

A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility

General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Jialin, Ye, Huang, Tian, Shaobo, Li, Xinyuan, Zhang, Jian
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 874
container_issue
container_start_page 863
container_title
container_volume
creator Li, Jialin
Ye, Huang
Tian, Shaobo
Li, Xinyuan
Zhang, Jian
description General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.
doi_str_mv 10.1109/IPDPS53621.2022.00089
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9820693</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9820693</ieee_id><sourcerecordid>9820693</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073</originalsourceid><addsrcrecordid>eNotj81Kw0AURkdBsNY-gQjzAql3ZjKTucvSn1hsMVC7cFUmyU070iYlmSJ9eyO6Oqvv8B3GngWMhQB8WWazbKOVkWIsQcoxAFi8YSNMrDBGx1aAwVs2EFpBJCHR9-yh674AJKgYB-xzwhe-pmjfuh4lz1qqKBQHX-_5pjjQiXjVtHyWztdr_kZtTceONzVPsy3_9uHAJ5fQROFS_w6mzensgs_90YfrI7ur3LGj0T-HbLuYf0xfo9V7upxOVpHvL4SoFFigVSUJ55zQAmOpdOniwjhtBYIztkJTyTwxOWgijJ1NtHKFJIMaEjVkT39eT0S7c-tPrr3u0Mo-XKkfnAhRyw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><source>IEEE Xplore All Conference Series</source><creator>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</creator><creatorcontrib>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</creatorcontrib><description>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</description><identifier>EISSN: 1530-2075</identifier><identifier>EISBN: 9781665481069</identifier><identifier>EISBN: 1665481064</identifier><identifier>DOI: 10.1109/IPDPS53621.2022.00089</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>AMD GCN Architecture ; DGEMM ; Graphics processing units ; High performance computing ; Libraries ; Mathematical models ; Parallel processing ; Performance gain ; Prefetching ; Register ; TLP ; Workgroup Parallelism</subject><ispartof>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, p.863-874</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9820693$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9820693$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Li, Jialin</creatorcontrib><creatorcontrib>Ye, Huang</creatorcontrib><creatorcontrib>Tian, Shaobo</creatorcontrib><creatorcontrib>Li, Xinyuan</creatorcontrib><creatorcontrib>Zhang, Jian</creatorcontrib><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><title>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</title><addtitle>IPDPS</addtitle><description>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</description><subject>AMD GCN Architecture</subject><subject>DGEMM</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Libraries</subject><subject>Mathematical models</subject><subject>Parallel processing</subject><subject>Performance gain</subject><subject>Prefetching</subject><subject>Register</subject><subject>TLP</subject><subject>Workgroup Parallelism</subject><issn>1530-2075</issn><isbn>9781665481069</isbn><isbn>1665481064</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81Kw0AURkdBsNY-gQjzAql3ZjKTucvSn1hsMVC7cFUmyU070iYlmSJ9eyO6Oqvv8B3GngWMhQB8WWazbKOVkWIsQcoxAFi8YSNMrDBGx1aAwVs2EFpBJCHR9-yh674AJKgYB-xzwhe-pmjfuh4lz1qqKBQHX-_5pjjQiXjVtHyWztdr_kZtTceONzVPsy3_9uHAJ5fQROFS_w6mzensgs_90YfrI7ur3LGj0T-HbLuYf0xfo9V7upxOVpHvL4SoFFigVSUJ55zQAmOpdOniwjhtBYIztkJTyTwxOWgijJ1NtHKFJIMaEjVkT39eT0S7c-tPrr3u0Mo-XKkfnAhRyw</recordid><startdate>202205</startdate><enddate>202205</enddate><creator>Li, Jialin</creator><creator>Ye, Huang</creator><creator>Tian, Shaobo</creator><creator>Li, Xinyuan</creator><creator>Zhang, Jian</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202205</creationdate><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><author>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>AMD GCN Architecture</topic><topic>DGEMM</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Libraries</topic><topic>Mathematical models</topic><topic>Parallel processing</topic><topic>Performance gain</topic><topic>Prefetching</topic><topic>Register</topic><topic>TLP</topic><topic>Workgroup Parallelism</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Jialin</creatorcontrib><creatorcontrib>Ye, Huang</creatorcontrib><creatorcontrib>Tian, Shaobo</creatorcontrib><creatorcontrib>Li, Xinyuan</creatorcontrib><creatorcontrib>Zhang, Jian</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Jialin</au><au>Ye, Huang</au><au>Tian, Shaobo</au><au>Li, Xinyuan</au><au>Zhang, Jian</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</atitle><btitle>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</btitle><stitle>IPDPS</stitle><date>2022-05</date><risdate>2022</risdate><spage>863</spage><epage>874</epage><pages>863-874</pages><eissn>1530-2075</eissn><eisbn>9781665481069</eisbn><eisbn>1665481064</eisbn><coden>IEEPAD</coden><abstract>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</abstract><pub>IEEE</pub><doi>10.1109/IPDPS53621.2022.00089</doi><tpages>12</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 1530-2075
ispartof 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, p.863-874
issn 1530-2075
language eng
recordid cdi_ieee_primary_9820693
source IEEE Xplore All Conference Series
subjects AMD GCN Architecture
DGEMM
Graphics processing units
High performance computing
Libraries
Mathematical models
Parallel processing
Performance gain
Prefetching
Register
TLP
Workgroup Parallelism
title A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T20%3A33%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Fine-grained%20Prefetching%20Scheme%20for%20DGEMM%20Kernels%20on%20GPU%20with%20Auto-tuning%20Compatibility&rft.btitle=2022%20IEEE%20International%20Parallel%20and%20Distributed%20Processing%20Symposium%20(IPDPS)&rft.au=Li,%20Jialin&rft.date=2022-05&rft.spage=863&rft.epage=874&rft.pages=863-874&rft.eissn=1530-2075&rft.coden=IEEPAD&rft_id=info:doi/10.1109/IPDPS53621.2022.00089&rft.eisbn=9781665481069&rft.eisbn_list=1665481064&rft_dat=%3Cieee_CHZPO%3E9820693%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9820693&rfr_iscdi=true