Loading…

Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment

Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We pr...

Full description

Saved in:
Bibliographic Details
Published in:Cluster computing 2023-06, Vol.26 (3), p.1949-1983
Main Authors: Vivek, Yelleti, Ravi, Vadlamani, Krishna, P. Radha
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3
cites cdi_FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3
container_end_page 1983
container_issue 3
container_start_page 1949
container_title Cluster computing
container_volume 26
creator Vivek, Yelleti
Ravi, Vadlamani
Krishna, P. Radha
description Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS PM ), and named them PB-ADE and P-DE-FS PM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality.
doi_str_mv 10.1007/s10586-022-03725-w
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9463682</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2714657251</sourcerecordid><originalsourceid>FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3</originalsourceid><addsrcrecordid>eNp9kctu1TAQhiMEoqXwAiyQJTZsAr472SChqlykSiyAtTVJJjkujh3s5Bz1DXhsXE4plwUbj6X55p_59VfVU0ZfMkrNq8yoanRNOa-pMFzVh3vVKVNG1EZJcb_8RWmbRpmT6lHOV5TS1vD2YXUidBnVsj2tvn_qwUPnkYwI65aQ5K3LuJKMHvvVxUDGmEjnJjLACmTLLkxkgQTeoye76y65geA--u0GhnRNwE8xuXU3kw4yDuSQYFkwkS0M5YUF-l3ZUiS-Egx7l2KYMayPqwcj-IxPbutZ9eXtxefz9_Xlx3cfzt9c1r00cq2BK6mUpopJ2SIDw_VAm3EAA8Vsp0bdmJ4bg41hbcvGBsFINSox8Ia3FMRZ9fqou2zdjENfVhcvdkluLsfbCM7-3QluZ6e4t63UQje8CLy4FUjx24Z5tbPLPXoPAeOWLTdMalXiYAV9_g96FbcUij3LW9ZwrRXTheJHqk8x54Tj3TGM2pug7TFoW4K2P4O2hzL07E8bdyO_ki2AOAK5tMKE6ffu_8j-AMCOt3c</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918266516</pqid></control><display><type>article</type><title>Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment</title><source>Springer Nature</source><creator>Vivek, Yelleti ; Ravi, Vadlamani ; Krishna, P. Radha</creator><creatorcontrib>Vivek, Yelleti ; Ravi, Vadlamani ; Krishna, P. Radha</creatorcontrib><description>Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS PM ), and named them PB-ADE and P-DE-FS PM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality.</description><identifier>ISSN: 1386-7857</identifier><identifier>EISSN: 1573-7543</identifier><identifier>DOI: 10.1007/s10586-022-03725-w</identifier><identifier>PMID: 36105649</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Ablation ; Adaptive algorithms ; Big Data ; Computer Communication Networks ; Computer Science ; Data mining ; Datasets ; Design ; Evolutionary algorithms ; Evolutionary computation ; Genetic algorithms ; Iterative methods ; Operating Systems ; Operators (mathematics) ; Optimization ; Parallel processing ; Permutations ; Processor Architectures</subject><ispartof>Cluster computing, 2023-06, Vol.26 (3), p.1949-1983</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3</citedby><cites>FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3</cites><orcidid>0000-0003-0082-6227</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36105649$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Vivek, Yelleti</creatorcontrib><creatorcontrib>Ravi, Vadlamani</creatorcontrib><creatorcontrib>Krishna, P. Radha</creatorcontrib><title>Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment</title><title>Cluster computing</title><addtitle>Cluster Comput</addtitle><addtitle>Cluster Comput</addtitle><description>Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS PM ), and named them PB-ADE and P-DE-FS PM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality.</description><subject>Ablation</subject><subject>Adaptive algorithms</subject><subject>Big Data</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Design</subject><subject>Evolutionary algorithms</subject><subject>Evolutionary computation</subject><subject>Genetic algorithms</subject><subject>Iterative methods</subject><subject>Operating Systems</subject><subject>Operators (mathematics)</subject><subject>Optimization</subject><subject>Parallel processing</subject><subject>Permutations</subject><subject>Processor Architectures</subject><issn>1386-7857</issn><issn>1573-7543</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kctu1TAQhiMEoqXwAiyQJTZsAr472SChqlykSiyAtTVJJjkujh3s5Bz1DXhsXE4plwUbj6X55p_59VfVU0ZfMkrNq8yoanRNOa-pMFzVh3vVKVNG1EZJcb_8RWmbRpmT6lHOV5TS1vD2YXUidBnVsj2tvn_qwUPnkYwI65aQ5K3LuJKMHvvVxUDGmEjnJjLACmTLLkxkgQTeoye76y65geA--u0GhnRNwE8xuXU3kw4yDuSQYFkwkS0M5YUF-l3ZUiS-Egx7l2KYMayPqwcj-IxPbutZ9eXtxefz9_Xlx3cfzt9c1r00cq2BK6mUpopJ2SIDw_VAm3EAA8Vsp0bdmJ4bg41hbcvGBsFINSox8Ia3FMRZ9fqou2zdjENfVhcvdkluLsfbCM7-3QluZ6e4t63UQje8CLy4FUjx24Z5tbPLPXoPAeOWLTdMalXiYAV9_g96FbcUij3LW9ZwrRXTheJHqk8x54Tj3TGM2pug7TFoW4K2P4O2hzL07E8bdyO_ki2AOAK5tMKE6ffu_8j-AMCOt3c</recordid><startdate>20230601</startdate><enddate>20230601</enddate><creator>Vivek, Yelleti</creator><creator>Ravi, Vadlamani</creator><creator>Krishna, P. Radha</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0003-0082-6227</orcidid></search><sort><creationdate>20230601</creationdate><title>Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment</title><author>Vivek, Yelleti ; Ravi, Vadlamani ; Krishna, P. Radha</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Ablation</topic><topic>Adaptive algorithms</topic><topic>Big Data</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Design</topic><topic>Evolutionary algorithms</topic><topic>Evolutionary computation</topic><topic>Genetic algorithms</topic><topic>Iterative methods</topic><topic>Operating Systems</topic><topic>Operators (mathematics)</topic><topic>Optimization</topic><topic>Parallel processing</topic><topic>Permutations</topic><topic>Processor Architectures</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Vivek, Yelleti</creatorcontrib><creatorcontrib>Ravi, Vadlamani</creatorcontrib><creatorcontrib>Krishna, P. Radha</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Cluster computing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Vivek, Yelleti</au><au>Ravi, Vadlamani</au><au>Krishna, P. Radha</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment</atitle><jtitle>Cluster computing</jtitle><stitle>Cluster Comput</stitle><addtitle>Cluster Comput</addtitle><date>2023-06-01</date><risdate>2023</risdate><volume>26</volume><issue>3</issue><spage>1949</spage><epage>1983</epage><pages>1949-1983</pages><issn>1386-7857</issn><eissn>1573-7543</eissn><abstract>Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS PM ), and named them PB-ADE and P-DE-FS PM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality.</abstract><cop>New York</cop><pub>Springer US</pub><pmid>36105649</pmid><doi>10.1007/s10586-022-03725-w</doi><tpages>35</tpages><orcidid>https://orcid.org/0000-0003-0082-6227</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1386-7857
ispartof Cluster computing, 2023-06, Vol.26 (3), p.1949-1983
issn 1386-7857
1573-7543
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9463682
source Springer Nature
subjects Ablation
Adaptive algorithms
Big Data
Computer Communication Networks
Computer Science
Data mining
Datasets
Design
Evolutionary algorithms
Evolutionary computation
Genetic algorithms
Iterative methods
Operating Systems
Operators (mathematics)
Optimization
Parallel processing
Permutations
Processor Architectures
title Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T04%3A37%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20feature%20subset%20selection%20for%20big%20data%20using%20parallel%20hybrid%20evolutionary%20algorithm%20based%20wrapper%20under%20apache%20spark%20environment&rft.jtitle=Cluster%20computing&rft.au=Vivek,%20Yelleti&rft.date=2023-06-01&rft.volume=26&rft.issue=3&rft.spage=1949&rft.epage=1983&rft.pages=1949-1983&rft.issn=1386-7857&rft.eissn=1573-7543&rft_id=info:doi/10.1007/s10586-022-03725-w&rft_dat=%3Cproquest_pubme%3E2714657251%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c474t-a254556051449e1a726d08fda7a386b5f687c277e871991f8ea745f53d28290a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2918266516&rft_id=info:pmid/36105649&rfr_iscdi=true