Loading…

Distributed Nonlinear Semiparametric Support Vector Machine for Big Data Applications on Spark Frameworks

In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on systems, man, and cybernetics. Systems man, and cybernetics. Systems, 2020-11, Vol.50 (11), p.4664-4675
Main Authors: Diaz-Morales, Roberto, Navia-Vazquez, Angel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years there has been a noticeable increase in the number of available Big Data infrastructures. This fact has promoted the adaptation of traditional machine learning techniques to be capable of addressing large scale problems in distributed environments. Kernel methods like support vector machines (SVMs) suffer from scalability problems due to their nonparametric nature and the complexity of their training procedures. In this paper, we propose a new and efficient distributed implementation of a training procedure for nonlinear semiparametric (budgeted) SVMs called distributed iterative reweighted least squares (IRWLS). This algorithm uses {k} -means to select the centroids of the semiparametric model and a new distributed algorithmic implementation of the IRWLS optimization procedure to find the weights of the model. We have implemented the proposed algorithm in Apache Spark and we have benchmarked it against other state-of-the-art methods, either full SVM ( {p} -pack SVM) or budgeted (budgeted stochastic gradient descent). Experimental results show that the proposed algorithm achieves higher accuracy while controlling the size of the final model, and also offers high performance in terms of run time and efficiency, when processing very large datasets (the computation time grows linear with the number of training patterns).
ISSN:2168-2216
2168-2232
DOI:10.1109/TSMC.2018.2858778