Loading…
On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification
Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of...
Saved in:
Published in: | Neurocomputing (Amsterdam) 2014-05, Vol.132, p.30-41 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Semi-supervised classification methods have received much attention as suitable tools to tackle training sets with large amounts of unlabeled data and a small quantity of labeled data. Several semi-supervised learning models have been proposed with different assumptions about the characteristics of the input data. Among them, the self-training process has emerged as a simple and effective technique, which does not require any specific hypotheses about the training data. Despite its effectiveness, the self-training algorithm usually make erroneous predictions, mainly at the initial stages, if noisy examples are labeled and incorporated into the training set.
Noise filters are commonly used to remove corrupted data in standard classification. In 2005, Li and Zhou proposed the addition of a statistical filter to the self-training process. Nevertheless, in this approach, filtering methods have to deal with a reduced number of labeled instances and the erroneous predictions it may induce. In this work, we analyze the integration of a wide variety of noise filters into the self-training process to distinguish the most relevant features of filters. We will focus on the nearest neighbor rule as a base classifier and ten different noise filters. We provide an extensive analysis of the performance of these filters considering different ratios of labeled data. The results are contrasted with nonparametric statistical tests that allow us to identify relevant filters, and their main characteristics, in the field of semi-supervised learning.
•The filtering process is more complex in SSL due to the number of labeled examples.•Inclusion of erroneous examples in labeled data can alter inductive capabilities.•Self-training filtered finds robust learned hypotheses to predict unseen cases.•Global filters highlight the best performing family of filters in SSL.•Local approaches need more labeled data to perform better. |
---|---|
ISSN: | 0925-2312 1872-8286 |
DOI: | 10.1016/j.neucom.2013.05.055 |