Loading…
A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility
General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 874 |
container_issue | |
container_start_page | 863 |
container_title | |
container_volume | |
creator | Li, Jialin Ye, Huang Tian, Shaobo Li, Xinyuan Zhang, Jian |
description | General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication. |
doi_str_mv | 10.1109/IPDPS53621.2022.00089 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9820693</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9820693</ieee_id><sourcerecordid>9820693</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073</originalsourceid><addsrcrecordid>eNotj81Kw0AURkdBsNY-gQjzAql3ZjKTucvSn1hsMVC7cFUmyU070iYlmSJ9eyO6Oqvv8B3GngWMhQB8WWazbKOVkWIsQcoxAFi8YSNMrDBGx1aAwVs2EFpBJCHR9-yh674AJKgYB-xzwhe-pmjfuh4lz1qqKBQHX-_5pjjQiXjVtHyWztdr_kZtTceONzVPsy3_9uHAJ5fQROFS_w6mzensgs_90YfrI7ur3LGj0T-HbLuYf0xfo9V7upxOVpHvL4SoFFigVSUJ55zQAmOpdOniwjhtBYIztkJTyTwxOWgijJ1NtHKFJIMaEjVkT39eT0S7c-tPrr3u0Mo-XKkfnAhRyw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><source>IEEE Xplore All Conference Series</source><creator>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</creator><creatorcontrib>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</creatorcontrib><description>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</description><identifier>EISSN: 1530-2075</identifier><identifier>EISBN: 9781665481069</identifier><identifier>EISBN: 1665481064</identifier><identifier>DOI: 10.1109/IPDPS53621.2022.00089</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>AMD GCN Architecture ; DGEMM ; Graphics processing units ; High performance computing ; Libraries ; Mathematical models ; Parallel processing ; Performance gain ; Prefetching ; Register ; TLP ; Workgroup Parallelism</subject><ispartof>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, p.863-874</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9820693$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9820693$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Li, Jialin</creatorcontrib><creatorcontrib>Ye, Huang</creatorcontrib><creatorcontrib>Tian, Shaobo</creatorcontrib><creatorcontrib>Li, Xinyuan</creatorcontrib><creatorcontrib>Zhang, Jian</creatorcontrib><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><title>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</title><addtitle>IPDPS</addtitle><description>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</description><subject>AMD GCN Architecture</subject><subject>DGEMM</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Libraries</subject><subject>Mathematical models</subject><subject>Parallel processing</subject><subject>Performance gain</subject><subject>Prefetching</subject><subject>Register</subject><subject>TLP</subject><subject>Workgroup Parallelism</subject><issn>1530-2075</issn><isbn>9781665481069</isbn><isbn>1665481064</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81Kw0AURkdBsNY-gQjzAql3ZjKTucvSn1hsMVC7cFUmyU070iYlmSJ9eyO6Oqvv8B3GngWMhQB8WWazbKOVkWIsQcoxAFi8YSNMrDBGx1aAwVs2EFpBJCHR9-yh674AJKgYB-xzwhe-pmjfuh4lz1qqKBQHX-_5pjjQiXjVtHyWztdr_kZtTceONzVPsy3_9uHAJ5fQROFS_w6mzensgs_90YfrI7ur3LGj0T-HbLuYf0xfo9V7upxOVpHvL4SoFFigVSUJ55zQAmOpdOniwjhtBYIztkJTyTwxOWgijJ1NtHKFJIMaEjVkT39eT0S7c-tPrr3u0Mo-XKkfnAhRyw</recordid><startdate>202205</startdate><enddate>202205</enddate><creator>Li, Jialin</creator><creator>Ye, Huang</creator><creator>Tian, Shaobo</creator><creator>Li, Xinyuan</creator><creator>Zhang, Jian</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202205</creationdate><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><author>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>AMD GCN Architecture</topic><topic>DGEMM</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Libraries</topic><topic>Mathematical models</topic><topic>Parallel processing</topic><topic>Performance gain</topic><topic>Prefetching</topic><topic>Register</topic><topic>TLP</topic><topic>Workgroup Parallelism</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Jialin</creatorcontrib><creatorcontrib>Ye, Huang</creatorcontrib><creatorcontrib>Tian, Shaobo</creatorcontrib><creatorcontrib>Li, Xinyuan</creatorcontrib><creatorcontrib>Zhang, Jian</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Jialin</au><au>Ye, Huang</au><au>Tian, Shaobo</au><au>Li, Xinyuan</au><au>Zhang, Jian</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</atitle><btitle>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</btitle><stitle>IPDPS</stitle><date>2022-05</date><risdate>2022</risdate><spage>863</spage><epage>874</epage><pages>863-874</pages><eissn>1530-2075</eissn><eisbn>9781665481069</eisbn><eisbn>1665481064</eisbn><coden>IEEPAD</coden><abstract>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</abstract><pub>IEEE</pub><doi>10.1109/IPDPS53621.2022.00089</doi><tpages>12</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 1530-2075 |
ispartof | 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, p.863-874 |
issn | 1530-2075 |
language | eng |
recordid | cdi_ieee_primary_9820693 |
source | IEEE Xplore All Conference Series |
subjects | AMD GCN Architecture DGEMM Graphics processing units High performance computing Libraries Mathematical models Parallel processing Performance gain Prefetching Register TLP Workgroup Parallelism |
title | A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T20%3A33%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Fine-grained%20Prefetching%20Scheme%20for%20DGEMM%20Kernels%20on%20GPU%20with%20Auto-tuning%20Compatibility&rft.btitle=2022%20IEEE%20International%20Parallel%20and%20Distributed%20Processing%20Symposium%20(IPDPS)&rft.au=Li,%20Jialin&rft.date=2022-05&rft.spage=863&rft.epage=874&rft.pages=863-874&rft.eissn=1530-2075&rft.coden=IEEPAD&rft_id=info:doi/10.1109/IPDPS53621.2022.00089&rft.eisbn=9781665481069&rft.eisbn_list=1665481064&rft_dat=%3Cieee_CHZPO%3E9820693%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9820693&rfr_iscdi=true |