Loading…

Distributed Nonlinear Semiparametric Support Vector Machine for Big Data Applications on Spark Frameworks

In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on systems, man, and cybernetics. Systems man, and cybernetics. Systems, 2020-11, Vol.50 (11), p.4664-4675
Main Authors:	Diaz-Morales, Roberto, Navia-Vazquez, Angel
Format:	Article
Language:	English
Subjects:	Algorithms Big Data budgeted Centroids Computational modeling distributed Iterative methods Kernel Machine learning Optimization Run time (computers) spark Sparks support vector machine (SVM) Support vector machines Training
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector machines (SVMs) suffer from scalability problems due to their nonparametric nature and the complexity of their training procedures. In this paper, we propose a new and efficient distributed implementation of a training procedure for nonlinear semiparametric (budgeted) SVMs called distributed iterative reweighted least squares (IRWLS). This algorithm uses {k} -means to select the centroids of the semiparametric model and a new distributed algorithmic implementation of the IRWLS optimization procedure to find the weights of the model. We have implemented the proposed algorithm in Apache Spark and we have benchmarked it against other state-of-the-art methods, either full SVM ( {p} -pack SVM) or budgeted (budgeted stochastic gradient descent). Experimental results show that the proposed algorithm achieves higher accuracy while controlling the size of the final model, and also offers high performance in terms of run time and efficiency, when processing very large datasets (the computation time grows linear with the number of training patterns).
ISSN:	2168-2216 2168-2232
DOI:	10.1109/TSMC.2018.2858778