Loading…

BB-ML: Basic Block Performance Prediction using Machine Learning Techniques

Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Abdelkhalik, Hamdy, Aktar, Shamminuj, Arafa, Yehia, Barai, Atanu, Chennupati, Gopinath, Santhi, Nandakishore, Panda, Nishant, Prajapati, Nirmal, Turja, Nazmul Haque, Eidenbenz, Stephan, Badawy, Abdel-Hameed A.
Format:	Conference Proceeding
Language:	English
Subjects:	Basic Block Codes Error analysis GPGPU Application Graphics processing units Machine learning Measurement Neural networks Performance Modeling Tensors
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. Utilizing ML and BB analysis together can enable scalable hardware-software co-design beyond the current state of the art. In this work, we extrapolate the basic block execution counts of GPU applications and use it for predicting the performance for large input sizes from the counts of smaller input sizes.We trained a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieved an accuracy of 93.5% for extrapolating the basic block counts for large input sets when the model is trained using smaller input sets. Additionally, the model shows an accuracy of 97.7% for predicting basic block counts on random instances. In a significant case study, we applied the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications, spanning linear algebra to machine learning benchmarks. We employed a diverse set of metrics for evaluation, including global memory requests, tensor cores' active cycles, and the active cycles of ALU and FMA units. The results from the case study demonstrate that the model is capable of predicting the performance of large datasets with high accuracy. For example, The average error rates for global and shared memory requests are 0.85% and 0.17%, respectively. Furthermore, to address the utilization of the main functional units in Ampere architecture GPUs, we calculated the active cycles for units like tensor cores, ALU, FMA, and FP64 units. Our predictions for the active cycles show an average error of 2.3% for the ALU and 10.66% for the FMA units, while the maximum observed error across all tested applications and units reaches 18.5%.
ISSN:	2690-5965
DOI:	10.1109/ICPADS60453.2023.00270