Loading…
A Parallel Multilevel Feature Selection algorithm for improved cancer classification
Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel...
Saved in:
Published in: | Journal of parallel and distributed computing 2020-04, Vol.138, p.78-98 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel algorithms reduce execution time, but with the penalty of reduced prediction accuracy. This research paper targets these two issues and proposes the following approaches. (1) Vertical partitioning of data along feature space and horizontal partitioning along samples in order to ease the task of data parallelism. (2) Parallel Multilevel Feature Selection (M-FS) algorithm to select optimal and important features for improved classification of cancer sub-types. The selected features are evaluated using parallel Random Forest on Spark, compared with previously reported results and also with the results of sequential execution of same algorithms. The proposed parallel M-FS algorithm was compared with existing parallel feature selection algorithms in terms of accuracy and execution time. The results reveal that parallel multilevel feature selection algorithm improved cancer classification resulting into prediction accuracy ranging from ∼85% to ∼99% with very high speed up in terms of seconds. On the other hand, existing sequential algorithms yielded prediction accuracy of ∼65% to ∼99% with execution time of more than 24 hours.
•Biological data keeps growing and dealing with this huge data is a challenging task.•Parallel Algorithms solve his issue with increase speed up but affects accuracy.•Parallel Multilevel Feature Selection method applies vertical & horizontal partition.•Parallel Multilevel Feature Selection method selects optimal and important features.•Classification followed by Feature Selection improves classification accuracy.•The proposed method improved accuracy at high speed up compared to existing methods. |
---|---|
ISSN: | 0743-7315 1096-0848 |
DOI: | 10.1016/j.jpdc.2019.12.015 |