Loading…

A Parallel Multilevel Feature Selection algorithm for improved cancer classification

Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel...

Full description

Saved in:
Bibliographic Details
Published in:Journal of parallel and distributed computing 2020-04, Vol.138, p.78-98
Main Authors: Venkataramana, Lokeswari, Jacob, Shomona Gracia, Ramadoss, Rajavel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel algorithms reduce execution time, but with the penalty of reduced prediction accuracy. This research paper targets these two issues and proposes the following approaches. (1) Vertical partitioning of data along feature space and horizontal partitioning along samples in order to ease the task of data parallelism. (2) Parallel Multilevel Feature Selection (M-FS) algorithm to select optimal and important features for improved classification of cancer sub-types. The selected features are evaluated using parallel Random Forest on Spark, compared with previously reported results and also with the results of sequential execution of same algorithms. The proposed parallel M-FS algorithm was compared with existing parallel feature selection algorithms in terms of accuracy and execution time. The results reveal that parallel multilevel feature selection algorithm improved cancer classification resulting into prediction accuracy ranging from ∼85% to ∼99% with very high speed up in terms of seconds. On the other hand, existing sequential algorithms yielded prediction accuracy of ∼65% to ∼99% with execution time of more than 24 hours. •Biological data keeps growing and dealing with this huge data is a challenging task.•Parallel Algorithms solve his issue with increase speed up but affects accuracy.•Parallel Multilevel Feature Selection method applies vertical & horizontal partition.•Parallel Multilevel Feature Selection method selects optimal and important features.•Classification followed by Feature Selection improves classification accuracy.•The proposed method improved accuracy at high speed up compared to existing methods.
ISSN:0743-7315
1096-0848
DOI:10.1016/j.jpdc.2019.12.015