Loading…

A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility

General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Jialin, Ye, Huang, Tian, Shaobo, Li, Xinyuan, Zhang, Jian
Format:	Conference Proceeding
Language:	English
Subjects:	AMD GCN Architecture DGEMM Graphics processing units High performance computing Libraries Mathematical models Parallel processing Performance gain Prefetching Register TLP Workgroup Parallelism
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	874
container_issue
container_start_page	863
container_title
container_volume
creator	Li, Jialin Ye, Huang Tian, Shaobo Li, Xinyuan Zhang, Jian
description	General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.
doi_str_mv	10.1109/IPDPS53621.2022.00089
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9820693</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9820693</ieee_id><sourcerecordid>9820693</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073</originalsourceid><addsrcrecordid>eNotj81Kw0AURkdBsNY-gQjzAql3ZjKTucvSn1hsMVC7cFUmyU070iYlmSJ9eyO6Oqvv8B3GngWMhQB8WWazbKOVkWIsQcoxAFi8YSNMrDBGx1aAwVs2EFpBJCHR9-yh674AJKgYB-xzwhe-pmjfuh4lz1qqKBQHX-_5pjjQiXjVtHyWztdr_kZtTceONzVPsy3_9uHAJ5fQROFS_w6mzensgs_90YfrI7ur3LGj0T-HbLuYf0xfo9V7upxOVpHvL4SoFFigVSUJ55zQAmOpdOniwjhtBYIztkJTyTwxOWgijJ1NtHKFJIMaEjVkT39eT0S7c-tPrr3u0Mo-XKkfnAhRyw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><source>IEEE Xplore All Conference Series</source><creator>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</creator><creatorcontrib>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</creatorcontrib><description>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</description><identifier>EISSN: 1530-2075</identifier><identifier>EISBN: 9781665481069</identifier><identifier>EISBN: 1665481064</identifier><identifier>DOI: 10.1109/IPDPS53621.2022.00089</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>AMD GCN Architecture ; DGEMM ; Graphics processing units ; High performance computing ; Libraries ; Mathematical models ; Parallel processing ; Performance gain ; Prefetching ; Register ; TLP ; Workgroup Parallelism</subject><ispartof>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, p.863-874</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9820693$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9820693$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Li, Jialin</creatorcontrib><creatorcontrib>Ye, Huang</creatorcontrib><creatorcontrib>Tian, Shaobo</creatorcontrib><creatorcontrib>Li, Xinyuan</creatorcontrib><creatorcontrib>Zhang, Jian</creatorcontrib><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><title>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</title><addtitle>IPDPS</addtitle><description>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</description><subject>AMD GCN Architecture</subject><subject>DGEMM</subject><subject>Graphics processing units</subject><subject>High performance computing</subject><subject>Libraries</subject><subject>Mathematical models</subject><subject>Parallel processing</subject><subject>Performance gain</subject><subject>Prefetching</subject><subject>Register</subject><subject>TLP</subject><subject>Workgroup Parallelism</subject><issn>1530-2075</issn><isbn>9781665481069</isbn><isbn>1665481064</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81Kw0AURkdBsNY-gQjzAql3ZjKTucvSn1hsMVC7cFUmyU070iYlmSJ9eyO6Oqvv8B3GngWMhQB8WWazbKOVkWIsQcoxAFi8YSNMrDBGx1aAwVs2EFpBJCHR9-yh674AJKgYB-xzwhe-pmjfuh4lz1qqKBQHX-_5pjjQiXjVtHyWztdr_kZtTceONzVPsy3_9uHAJ5fQROFS_w6mzensgs_90YfrI7ur3LGj0T-HbLuYf0xfo9V7upxOVpHvL4SoFFigVSUJ55zQAmOpdOniwjhtBYIztkJTyTwxOWgijJ1NtHKFJIMaEjVkT39eT0S7c-tPrr3u0Mo-XKkfnAhRyw</recordid><startdate>202205</startdate><enddate>202205</enddate><creator>Li, Jialin</creator><creator>Ye, Huang</creator><creator>Tian, Shaobo</creator><creator>Li, Xinyuan</creator><creator>Zhang, Jian</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202205</creationdate><title>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</title><author>Li, Jialin ; Ye, Huang ; Tian, Shaobo ; Li, Xinyuan ; Zhang, Jian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>AMD GCN Architecture</topic><topic>DGEMM</topic><topic>Graphics processing units</topic><topic>High performance computing</topic><topic>Libraries</topic><topic>Mathematical models</topic><topic>Parallel processing</topic><topic>Performance gain</topic><topic>Prefetching</topic><topic>Register</topic><topic>TLP</topic><topic>Workgroup Parallelism</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Jialin</creatorcontrib><creatorcontrib>Ye, Huang</creatorcontrib><creatorcontrib>Tian, Shaobo</creatorcontrib><creatorcontrib>Li, Xinyuan</creatorcontrib><creatorcontrib>Zhang, Jian</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Jialin</au><au>Ye, Huang</au><au>Tian, Shaobo</au><au>Li, Xinyuan</au><au>Zhang, Jian</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility</atitle><btitle>2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</btitle><stitle>IPDPS</stitle><date>2022-05</date><risdate>2022</risdate><spage>863</spage><epage>874</epage><pages>863-874</pages><eissn>1530-2075</eissn><eisbn>9781665481069</eisbn><eisbn>1665481064</eisbn><coden>IEEPAD</coden><abstract>General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.</abstract><pub>IEEE</pub><doi>10.1109/IPDPS53621.2022.00089</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 1530-2075
ispartof	2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, p.863-874
issn	1530-2075
language	eng
recordid	cdi_ieee_primary_9820693
source	IEEE Xplore All Conference Series
subjects	AMD GCN Architecture DGEMM Graphics processing units High performance computing Libraries Mathematical models Parallel processing Performance gain Prefetching Register TLP Workgroup Parallelism
title	A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T20%3A33%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Fine-grained%20Prefetching%20Scheme%20for%20DGEMM%20Kernels%20on%20GPU%20with%20Auto-tuning%20Compatibility&rft.btitle=2022%20IEEE%20International%20Parallel%20and%20Distributed%20Processing%20Symposium%20(IPDPS)&rft.au=Li,%20Jialin&rft.date=2022-05&rft.spage=863&rft.epage=874&rft.pages=863-874&rft.eissn=1530-2075&rft.coden=IEEPAD&rft_id=info:doi/10.1109/IPDPS53621.2022.00089&rft.eisbn=9781665481069&rft.eisbn_list=1665481064&rft_dat=%3Cieee_CHZPO%3E9820693%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d19c983de1aaa15194235da4c6a58190a68f96f2b76b05ee94a8753ac2e695073%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9820693&rfr_iscdi=true