Loading…

Adaptive load balancing in cluster computing environment

Owing to the availability of high computing servers and clusters to process the data, factors such as data skewness, class imbalance, and scalability in big data cause slow processing performance. This study proposes a framework for load balancing in the Apache Spark cluster that makes efficient use...

Full description

Saved in:

Bibliographic Details
Published in:	The Journal of supercomputing 2023-11, Vol.79 (17), p.20179-20207
Main Authors:	Singh, Tinku, Gupta, Shivam, Satakshi, Kumar, Manish
Format:	Article
Language:	English
Subjects:	Big Data Clusters Compilers Computation Computer Science Datasets Equivalence Executors Interpreters Knapsack problem Load balancing Particle swarm optimization Placement Processor Architectures Product reviews Programming Languages Skewness
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Owing to the availability of high computing servers and clusters to process the data, factors such as data skewness, class imbalance, and scalability in big data cause slow processing performance. This study proposes a framework for load balancing in the Apache Spark cluster that makes efficient use of cluster resources and improves overall processing performance. The proposed method configures the Apache Spark cluster initially, to fix the optimal number of CPU cores and memory for each executor. The proposed scheme explores the trade-off between workload balance and communication efficiency, while for dynamic task allocation, coarse-grained and fine-grained data placement strategies are being used. A coarse-grained strategy handles the execution of datasets with larger partitions, but comparatively few executors, by transforming resilient distributed datasets (RDDs) into smaller datasets. Fine-grained strategy handles the execution of datasets with a large number of executors in comparison with partitions; the number of executors is considered equivalent to the number of knapsacks, and the resulting multidimensional knapsack problem is solved using particle swarm optimization to join the data partitions. The partitions were transformed into RDDs equivalent to the number of executors. All experiments were carried out using a large-scale dataset comprised of Amazon product reviews data in JSON format. The fine-grained and coarse-grained data placement strategies were found to be 26.12% and 42% faster in terms of execution time compared to the default data placement approach.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-023-05434-6