Loading…
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which a...
Saved in:
Main Authors: | , , , , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. Utilizing ML and BB analysis together can enable scalable hardware-software co-design beyond the current state of the art. In this work, we extrapolate the basic block execution counts of GPU applications and use it for predicting the performance for large input sizes from the counts of smaller input sizes.We trained a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieved an accuracy of 93.5% for extrapolating the basic block counts for large input sets when the model is trained using smaller input sets. Additionally, the model shows an accuracy of 97.7% for predicting basic block counts on random instances. In a significant case study, we applied the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications, spanning linear algebra to machine learning benchmarks. We employed a diverse set of metrics for evaluation, including global memory requests, tensor cores' active cycles, and the active cycles of ALU and FMA units. The results from the case study demonstrate that the model is capable of predicting the performance of large datasets with high accuracy. For example, The average error rates for global and shared memory requests are 0.85% and 0.17%, respectively. Furthermore, to address the utilization of the main functional units in Ampere architecture GPUs, we calculated the active cycles for units like tensor cores, ALU, FMA, and FP64 units. Our predictions for the active cycles show an average error of 2.3% for the ALU and 10.66% for the FMA units, while the maximum observed error across all tested applications and units reaches 18.5%. |
---|---|
ISSN: | 2690-5965 |
DOI: | 10.1109/ICPADS60453.2023.00270 |