Loading…

Detecting refactoring type of software commit messages based on ensemble machine learning algorithms

Refactoring is a well-established topic in contemporary software engineering, focusing on enhancing software's structural design without altering its external behavior. Commit messages play a vital role in tracking changes to the codebase. However, determining the exact refactoring required in...

Full description

Saved in:
Bibliographic Details
Published in:Scientific reports 2024-09, Vol.14 (1), p.21367-20, Article 21367
Main Authors: Al-Fraihat, Dimah, Sharrab, Yousef, Al-Ghuwairi, Abdel-Rahman, Sbaih, Nour, Qahmash, Ayman
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Refactoring is a well-established topic in contemporary software engineering, focusing on enhancing software's structural design without altering its external behavior. Commit messages play a vital role in tracking changes to the codebase. However, determining the exact refactoring required in the code can be challenging due to various refactoring types. Prior studies have attempted to classify refactoring documentation by type, achieving acceptable results in accuracy, precision, recall, F1-Score, and other performance metrics. Nevertheless, there is room for improvement. To address this, we propose a novel approach using four ensemble Machine Learning algorithms to detect refactoring types. Our experimentation utilized a dataset containing 573 commits, with text cleaning and preprocessing applied to address data imbalances. Various techniques, including hyperparameter optimization, feature engineering with TF-IDF and bag-of-words, and binary transformation using one-vs-one and one-vs-rest classifiers, were employed to enhance accuracy. Results indicate that the experiment involving feature engineering using the TF-IDF technique outperformed other methods. Notably, the XGBoost algorithm with the same technique achieved superior performance across all metrics, attaining 100% accuracy. Moreover, our results surpass the current state-of-the-art performance using the same dataset. Our proposed approach bears significant implications for software engineering, particularly in enhancing the internal quality of software.
ISSN:2045-2322
2045-2322
DOI:10.1038/s41598-024-72307-0