Loading…

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most...

Full description

Saved in:

Bibliographic Details
Published in:	The Journal of supercomputing 2024, Vol.80 (12), p.17760-17789
Main Authors:	Li, Xinqi, Liu, Qicheng
Format:	Article
Language:	English
Subjects:	Adaptive sampling Algorithms Classification Clustering Clusters Compilers Computer Science Data analysis Interpreters Machine learning Oversampling Processor Architectures Programming Languages
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c270t-acba5cacb446767240e9168b1c68dcc9fbf53afa79e6d1d22fb6175d27a743eb3
container_end_page	17789
container_issue	12
container_start_page	17760
container_title	The Journal of supercomputing
container_volume	80
creator	Li, Xinqi Liu, Qicheng
description	Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.
doi_str_mv	10.1007/s11227-024-06132-7
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3077092507</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3077092507</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-acba5cacb446767240e9168b1c68dcc9fbf53afa79e6d1d22fb6175d27a743eb3</originalsourceid><addsrcrecordid>eNp9kEtLw0AQxxdRsFa_gKeA59V9JZt4k7Y-QOmh9bzMPtKm5OXuRvDbmxrBm5cZGH7_GeaH0DUlt5QQeRcoZUxiwgQmGeUMyxM0o6nkmIhcnKIZKRjBeSrYOboI4UAIEVzyGdovl5sF3rytt6v7BNqkajTU0BpnEwsRku7T-QBNX1ftLoF61_kq7ptEQxiJrp0gW4XoKz3EapxAa5PQOxM91ImphxCdH8OX6KyEOrir3z5H74-r7eIZv66fXhYPr9gwSSIGoyE1YxUik5lkgriCZrmmJsutMUWpy5RDCbJwmaWWsVJnVKaWSZCCO83n6Gba2_vuY3AhqkM3-HY8qTiRcvSQEjlSbKKM70LwrlS9rxrwX4oSdTSqJqNqNKp-jKpjiE-h0B8_cv5v9T-pb70nehQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3077092507</pqid></control><display><type>article</type><title>DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering</title><source>Springer Nature</source><creator>Li, Xinqi ; Liu, Qicheng</creator><creatorcontrib>Li, Xinqi ; Liu, Qicheng</creatorcontrib><description>Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.</description><identifier>ISSN: 0920-8542</identifier><identifier>EISSN: 1573-0484</identifier><identifier>DOI: 10.1007/s11227-024-06132-7</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Adaptive sampling ; Algorithms ; Classification ; Clustering ; Clusters ; Compilers ; Computer Science ; Data analysis ; Interpreters ; Machine learning ; Oversampling ; Processor Architectures ; Programming Languages</subject><ispartof>The Journal of supercomputing, 2024, Vol.80 (12), p.17760-17789</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-acba5cacb446767240e9168b1c68dcc9fbf53afa79e6d1d22fb6175d27a743eb3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27923,27924</link.rule.ids></links><search><creatorcontrib>Li, Xinqi</creatorcontrib><creatorcontrib>Liu, Qicheng</creatorcontrib><title>DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering</title><title>The Journal of supercomputing</title><addtitle>J Supercomput</addtitle><description>Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.</description><subject>Adaptive sampling</subject><subject>Algorithms</subject><subject>Classification</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Compilers</subject><subject>Computer Science</subject><subject>Data analysis</subject><subject>Interpreters</subject><subject>Machine learning</subject><subject>Oversampling</subject><subject>Processor Architectures</subject><subject>Programming Languages</subject><issn>0920-8542</issn><issn>1573-0484</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLw0AQxxdRsFa_gKeA59V9JZt4k7Y-QOmh9bzMPtKm5OXuRvDbmxrBm5cZGH7_GeaH0DUlt5QQeRcoZUxiwgQmGeUMyxM0o6nkmIhcnKIZKRjBeSrYOboI4UAIEVzyGdovl5sF3rytt6v7BNqkajTU0BpnEwsRku7T-QBNX1ftLoF61_kq7ptEQxiJrp0gW4XoKz3EapxAa5PQOxM91ImphxCdH8OX6KyEOrir3z5H74-r7eIZv66fXhYPr9gwSSIGoyE1YxUik5lkgriCZrmmJsutMUWpy5RDCbJwmaWWsVJnVKaWSZCCO83n6Gba2_vuY3AhqkM3-HY8qTiRcvSQEjlSbKKM70LwrlS9rxrwX4oSdTSqJqNqNKp-jKpjiE-h0B8_cv5v9T-pb70nehQ</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Li, Xinqi</creator><creator>Liu, Qicheng</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>2024</creationdate><title>DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering</title><author>Li, Xinqi ; Liu, Qicheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-acba5cacb446767240e9168b1c68dcc9fbf53afa79e6d1d22fb6175d27a743eb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptive sampling</topic><topic>Algorithms</topic><topic>Classification</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Compilers</topic><topic>Computer Science</topic><topic>Data analysis</topic><topic>Interpreters</topic><topic>Machine learning</topic><topic>Oversampling</topic><topic>Processor Architectures</topic><topic>Programming Languages</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Xinqi</creatorcontrib><creatorcontrib>Liu, Qicheng</creatorcontrib><collection>CrossRef</collection><jtitle>The Journal of supercomputing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Xinqi</au><au>Liu, Qicheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering</atitle><jtitle>The Journal of supercomputing</jtitle><stitle>J Supercomput</stitle><date>2024</date><risdate>2024</risdate><volume>80</volume><issue>12</issue><spage>17760</spage><epage>17789</epage><pages>17760-17789</pages><issn>0920-8542</issn><eissn>1573-0484</eissn><abstract>Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11227-024-06132-7</doi><tpages>30</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0920-8542
ispartof	The Journal of supercomputing, 2024, Vol.80 (12), p.17760-17789
issn	0920-8542 1573-0484
language	eng
recordid	cdi_proquest_journals_3077092507
source	Springer Nature
subjects	Adaptive sampling Algorithms Classification Clustering Clusters Compilers Computer Science Data analysis Interpreters Machine learning Oversampling Processor Architectures Programming Languages
title	DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T07%3A07%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DDSC-SMOTE:%20an%20imbalanced%20data%20oversampling%20algorithm%20based%20on%20data%20distribution%20and%20spectral%20clustering&rft.jtitle=The%20Journal%20of%20supercomputing&rft.au=Li,%20Xinqi&rft.date=2024&rft.volume=80&rft.issue=12&rft.spage=17760&rft.epage=17789&rft.pages=17760-17789&rft.issn=0920-8542&rft.eissn=1573-0484&rft_id=info:doi/10.1007/s11227-024-06132-7&rft_dat=%3Cproquest_cross%3E3077092507%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c270t-acba5cacb446767240e9168b1c68dcc9fbf53afa79e6d1d22fb6175d27a743eb3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3077092507&rft_id=info:pmid/&rfr_iscdi=true