Loading…
Distributed Nonlinear Semiparametric Support Vector Machine for Big Data Applications on Spark Frameworks
In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector...
Saved in:
Published in: | IEEE transactions on systems, man, and cybernetics. Systems man, and cybernetics. Systems, 2020-11, Vol.50 (11), p.4664-4675 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector machines (SVMs) suffer from scalability problems due to their nonparametric nature and the complexity of their training procedures. In this paper, we propose a new and efficient distributed implementation of a training procedure for nonlinear semiparametric (budgeted) SVMs called distributed iterative reweighted least squares (IRWLS). This algorithm uses {k} -means to select the centroids of the semiparametric model and a new distributed algorithmic implementation of the IRWLS optimization procedure to find the weights of the model. We have implemented the proposed algorithm in Apache Spark and we have benchmarked it against other state-of-the-art methods, either full SVM ( {p} -pack SVM) or budgeted (budgeted stochastic gradient descent). Experimental results show that the proposed algorithm achieves higher accuracy while controlling the size of the final model, and also offers high performance in terms of run time and efficiency, when processing very large datasets (the computation time grows linear with the number of training patterns). |
---|---|
ISSN: | 2168-2216 2168-2232 |
DOI: | 10.1109/TSMC.2018.2858778 |