Loading…

Efficient straggler task management in cloud environment using stochastic gradient descent with momentum learning-driven neural networks

In the modern era, large-scale computing systems distribute tasks into smaller units, allowing them to be executed simultaneously, accelerating job completion, and reducing energy usage. However, cloud computing systems face a significant challenge: the Long Tail problem. This problem arises when a...

Full description

Saved in:
Bibliographic Details
Published in:Cluster computing 2024-07, Vol.27 (4), p.4673-4685
Main Authors: Swain, Smruti Rekha, Parashar, Anshu, Singh, Ashutosh Kumar, Lee, Chung Nan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In the modern era, large-scale computing systems distribute tasks into smaller units, allowing them to be executed simultaneously, accelerating job completion, and reducing energy usage. However, cloud computing systems face a significant challenge: the Long Tail problem. This problem arises when a small subset of slow-performing tasks impedes the overall progress of parallel job execution, resulting in longer service response times and decreased system efficiency. To reduce task execution time and energy consumption, we propose an efficient straggler task management framework for cloud data centers in this paper. A neural network-based resource predictor is initially developed and tuned with the Stochastic Gradient Descent with Momentum mechanism to analyze and classify heterogeneous tasks into stragglers and non-stragglers. Then, after identifying the straggler tasks, they are further classified into two categories: Resource Hunters and Long-Tail stragglers, based on their specific resource requirements. A task management policy is implemented to achieve parallelism and enhance sustainability in the cloud infrastructure. Considering the task category, this policy effectively schedules and allocates resources among user job requests. To evaluate the effectiveness of the proposed work, extensive simulations are performed using the Google Cluster Dataset (GCD). The results obtained from these simulations are subsequently compared to state-of-the-art techniques for a comprehensive analysis. The experimental results reveal substantial improvements in various metrics, including power consumption and active servers showing reductions of up to 55.16% and 35%, respectively. Furthermore, there has been a reduction in execution time of up to 67.74%.
ISSN:1386-7857
1573-7543
DOI:10.1007/s10586-023-04191-8