Loading…

A new data analysis method based on feature linear combination

[Display omitted] •A method is proposed to define efficient classification rules by pairwise feature evaluation.•SVM with linear kernel is used to explore the unique best linear relationship for each feature pair.•k > 0 top scored pairs are selected to build an ensemble classifier.•Experiments on...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomedical informatics 2019-06, Vol.94, p.103173-103173, Article 103173
Main Authors: Lin, Xiaohui, Zhang, Yanhui, Li, Chao, Wang, Jue, Luo, Ping, Zhou, Huiwei
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[Display omitted] •A method is proposed to define efficient classification rules by pairwise feature evaluation.•SVM with linear kernel is used to explore the unique best linear relationship for each feature pair.•k > 0 top scored pairs are selected to build an ensemble classifier.•Experiments on public datasets and one metabolomics data showed the validity of LC-k-TSP. In biological data, feature relationships are complex and diverse, they could reflect physiological and pathological changes. Defining simple and efficient classification rules based on feature relationships is helpful for discriminating different conditions and studying disease mechanism. The popular data analysis method, k top scoring pairs (k-TSP), explores the feature relationship by focusing on the difference of the relative level of two features in different groups and classifies samples based on the exploration. To define more efficient classification rules, we propose a new data analysis method based on the linear combination of k > 0 top scoring pairs (LC-k-TSP). LC-k-TSP applies support vector machine (SVM) to define the best linear relationship of each feature pair, scores feature pairs by the discriminative abilities of the corresponding linear combinations and selects k disjoint top scoring pairs to construct an ensemble classifier. Experiments on twelve public datasets showed the superiority of LC-k-TSP over k-TSP which evaluates the relationship of every two features in the same way. The experiment also illustrated that LC-k-TSP performed similarly to SVM and random forest (RF) in accuracy rate. LC-k-TSP studies the own unique linear combination for each feature pair and defines simple classification rules, it is easy to explore the biomedical explanation. Finally, we applied LC-k-TSP to analyze the hepatocellular carcinoma (HCC) metabolomics data and define the simple classification rules for discrimination of different liver diseases. It obtained accuracy rates of 89.76% and 89.13% in distinguishing between small HCC and hepatic cirrhosis (CIR) groups as well as between HCC and CIR groups, superior to 87.99% and 80.35% by k-TSP. Hence, defining classification rules based on feature relationships is an effective way to analyze biological data. LC-k-TSP which checks different feature pairs by their corresponding unique best linear relationship has the superiority over k-TSP which checks each pair by the same linear relationship. Availability and implementation: http://www.402
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2019.103173