Loading…

On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification

Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of...

Full description

Saved in:
Bibliographic Details
Published in:Neurocomputing (Amsterdam) 2014-05, Vol.132, p.30-41
Main Authors: Triguero, Isaac, Sáez, José A., Luengo, Julián, García, Salvador, Herrera, Francisco
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53
cites cdi_FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53
container_end_page 41
container_issue
container_start_page 30
container_title Neurocomputing (Amsterdam)
container_volume 132
creator Triguero, Isaac
Sáez, José A.
Luengo, Julián
García, Salvador
Herrera, Francisco
description Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of the input data. Among them, the self-training process has emerged as a simple and effective technique, which does not require any specific hypotheses about the training data. Despite its effectiveness, the self-training algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled and incorporated into the training set. Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into the self-training process to distinguish the most relevant features of filters. We will focus on the nearest neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the performance of these filters considering different ratios of labeled data. The results are contrasted with nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in the field of semi-supervised learning. •The filtering process is more complex in SSL due to the number of labeled examples.•Inclusion of erroneous examples in labeled data can alter inductive capabilities.•Self-training filtered finds robust learned hypotheses to predict unseen cases.•Global filters highlight the best performing family of filters in SSL.•Local approaches need more labeled data to perform better.
doi_str_mv 10.1016/j.neucom.2013.05.055
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1793282584</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0925231213011016</els_id><sourcerecordid>1793282584</sourcerecordid><originalsourceid>FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53</originalsourceid><addsrcrecordid>eNp9UE1rGzEQFaGFuGn-QQ66BHpZR9KuvNpLoZjmAwy-JGcha0fxmF3J1awN7a-PHIcewwwMM7x5b-YxdiPFXAq5uNvNIxx8GudKyHoudEl9wWbStKoyyiy-sJnolK5ULdUl-0a0E0K2UnUzNq4jn7bA_dZl5yfI-M9NmCJPgceEBDzgUMbEQ8qcYAjVlB1GjK-lG7Giwx7ysQB7jpFHcBloKhVft5uy4QdHhAH9O-t39jW4geD6o16xl_vfz8vHarV-eFr-WlW-XnRT5XpotGjANN5IGcwGXFBdeWEjQt0avelBtKbWuoxas5Cyb5tihAnGKxd6XV-xH2fefU5_DuUgOyJ5GAYXIR3IyrarlVHaNAXanKE-J6IMwe4zji7_tVLYk7t2Z8_u2pO7VuiSJ4XbDwVH3g0hu-iR_u8W8hJSFtzPMw7Ku0eEbMkjRA89ZvCT7RN-LvQGdXeTwQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1793282584</pqid></control><display><type>article</type><title>On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification</title><source>ScienceDirect Journals</source><creator>Triguero, Isaac ; Sáez, José A. ; Luengo, Julián ; García, Salvador ; Herrera, Francisco</creator><creatorcontrib>Triguero, Isaac ; Sáez, José A. ; Luengo, Julián ; García, Salvador ; Herrera, Francisco</creatorcontrib><description>Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of the input data. Among them, the self-training process has emerged as a simple and effective technique, which does not require any specific hypotheses about the training data. Despite its effectiveness, the self-training algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled and incorporated into the training set. Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into the self-training process to distinguish the most relevant features of filters. We will focus on the nearest neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the performance of these filters considering different ratios of labeled data. The results are contrasted with nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in the field of semi-supervised learning. •The filtering process is more complex in SSL due to the number of labeled examples.•Inclusion of erroneous examples in labeled data can alter inductive capabilities.•Self-training filtered finds robust learned hypotheses to predict unseen cases.•Global filters highlight the best performing family of filters in SSL.•Local approaches need more labeled data to perform better.</description><identifier>ISSN: 0925-2312</identifier><identifier>EISSN: 1872-8286</identifier><identifier>DOI: 10.1016/j.neucom.2013.05.055</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Algorithms ; Applied sciences ; Artificial intelligence ; Classification ; Computer science; control theory; systems ; Exact sciences and technology ; Hypotheses ; Learning ; Learning and adaptive systems ; Mathematical models ; Nearest neighbor classification ; Noise ; Noise filters ; Noisy data ; Self-training ; Semi-supervised learning ; Training</subject><ispartof>Neurocomputing (Amsterdam), 2014-05, Vol.132, p.30-41</ispartof><rights>2013 Elsevier B.V.</rights><rights>2015 INIST-CNRS</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53</citedby><cites>FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>309,310,314,780,784,789,790,23928,23929,25138,27922,27923</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=28282811$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Triguero, Isaac</creatorcontrib><creatorcontrib>Sáez, José A.</creatorcontrib><creatorcontrib>Luengo, Julián</creatorcontrib><creatorcontrib>García, Salvador</creatorcontrib><creatorcontrib>Herrera, Francisco</creatorcontrib><title>On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification</title><title>Neurocomputing (Amsterdam)</title><description>Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of the input data. Among them, the self-training process has emerged as a simple and effective technique, which does not require any specific hypotheses about the training data. Despite its effectiveness, the self-training algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled and incorporated into the training set. Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into the self-training process to distinguish the most relevant features of filters. We will focus on the nearest neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the performance of these filters considering different ratios of labeled data. The results are contrasted with nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in the field of semi-supervised learning. •The filtering process is more complex in SSL due to the number of labeled examples.•Inclusion of erroneous examples in labeled data can alter inductive capabilities.•Self-training filtered finds robust learned hypotheses to predict unseen cases.•Global filters highlight the best performing family of filters in SSL.•Local approaches need more labeled data to perform better.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Classification</subject><subject>Computer science; control theory; systems</subject><subject>Exact sciences and technology</subject><subject>Hypotheses</subject><subject>Learning</subject><subject>Learning and adaptive systems</subject><subject>Mathematical models</subject><subject>Nearest neighbor classification</subject><subject>Noise</subject><subject>Noise filters</subject><subject>Noisy data</subject><subject>Self-training</subject><subject>Semi-supervised learning</subject><subject>Training</subject><issn>0925-2312</issn><issn>1872-8286</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><recordid>eNp9UE1rGzEQFaGFuGn-QQ66BHpZR9KuvNpLoZjmAwy-JGcha0fxmF3J1awN7a-PHIcewwwMM7x5b-YxdiPFXAq5uNvNIxx8GudKyHoudEl9wWbStKoyyiy-sJnolK5ULdUl-0a0E0K2UnUzNq4jn7bA_dZl5yfI-M9NmCJPgceEBDzgUMbEQ8qcYAjVlB1GjK-lG7Giwx7ysQB7jpFHcBloKhVft5uy4QdHhAH9O-t39jW4geD6o16xl_vfz8vHarV-eFr-WlW-XnRT5XpotGjANN5IGcwGXFBdeWEjQt0avelBtKbWuoxas5Cyb5tihAnGKxd6XV-xH2fefU5_DuUgOyJ5GAYXIR3IyrarlVHaNAXanKE-J6IMwe4zji7_tVLYk7t2Z8_u2pO7VuiSJ4XbDwVH3g0hu-iR_u8W8hJSFtzPMw7Ku0eEbMkjRA89ZvCT7RN-LvQGdXeTwQ</recordid><startdate>20140520</startdate><enddate>20140520</enddate><creator>Triguero, Isaac</creator><creator>Sáez, José A.</creator><creator>Luengo, Julián</creator><creator>García, Salvador</creator><creator>Herrera, Francisco</creator><general>Elsevier B.V</general><general>Elsevier</general><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20140520</creationdate><title>On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification</title><author>Triguero, Isaac ; Sáez, José A. ; Luengo, Julián ; García, Salvador ; Herrera, Francisco</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Classification</topic><topic>Computer science; control theory; systems</topic><topic>Exact sciences and technology</topic><topic>Hypotheses</topic><topic>Learning</topic><topic>Learning and adaptive systems</topic><topic>Mathematical models</topic><topic>Nearest neighbor classification</topic><topic>Noise</topic><topic>Noise filters</topic><topic>Noisy data</topic><topic>Self-training</topic><topic>Semi-supervised learning</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Triguero, Isaac</creatorcontrib><creatorcontrib>Sáez, José A.</creatorcontrib><creatorcontrib>Luengo, Julián</creatorcontrib><creatorcontrib>García, Salvador</creatorcontrib><creatorcontrib>Herrera, Francisco</creatorcontrib><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Neurocomputing (Amsterdam)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Triguero, Isaac</au><au>Sáez, José A.</au><au>Luengo, Julián</au><au>García, Salvador</au><au>Herrera, Francisco</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification</atitle><jtitle>Neurocomputing (Amsterdam)</jtitle><date>2014-05-20</date><risdate>2014</risdate><volume>132</volume><spage>30</spage><epage>41</epage><pages>30-41</pages><issn>0925-2312</issn><eissn>1872-8286</eissn><abstract>Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of the input data. Among them, the self-training process has emerged as a simple and effective technique, which does not require any specific hypotheses about the training data. Despite its effectiveness, the self-training algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled and incorporated into the training set. Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into the self-training process to distinguish the most relevant features of filters. We will focus on the nearest neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the performance of these filters considering different ratios of labeled data. The results are contrasted with nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in the field of semi-supervised learning. •The filtering process is more complex in SSL due to the number of labeled examples.•Inclusion of erroneous examples in labeled data can alter inductive capabilities.•Self-training filtered finds robust learned hypotheses to predict unseen cases.•Global filters highlight the best performing family of filters in SSL.•Local approaches need more labeled data to perform better.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.neucom.2013.05.055</doi><tpages>12</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0925-2312
ispartof Neurocomputing (Amsterdam), 2014-05, Vol.132, p.30-41
issn 0925-2312
1872-8286
language eng
recordid cdi_proquest_miscellaneous_1793282584
source ScienceDirect Journals
subjects Algorithms
Applied sciences
Artificial intelligence
Classification
Computer science
control theory
systems
Exact sciences and technology
Hypotheses
Learning
Learning and adaptive systems
Mathematical models
Nearest neighbor classification
Noise
Noise filters
Noisy data
Self-training
Semi-supervised learning
Training
title On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T09%3A28%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20characterization%20of%20noise%20filters%20for%20self-training%20semi-supervised%20in%20nearest%20neighbor%20classification&rft.jtitle=Neurocomputing%20(Amsterdam)&rft.au=Triguero,%20Isaac&rft.date=2014-05-20&rft.volume=132&rft.spage=30&rft.epage=41&rft.pages=30-41&rft.issn=0925-2312&rft.eissn=1872-8286&rft_id=info:doi/10.1016/j.neucom.2013.05.055&rft_dat=%3Cproquest_cross%3E1793282584%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c369t-ade4504e84c811f8beaf29092b0f3785bde07835509278611d741018f8c2afd53%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1793282584&rft_id=info:pmid/&rfr_iscdi=true