Loading…

Unsupervised ensemble learning for genome sequencing

•The variant calling step in next generation sequencing technologies is formulated as a classification problem.•An unsupervised ensemble classification method is proposed as a variant caller for DNA sequencing.•An EM-based variant calling algorithm that estimates the maximum a posteriori class to ta...

Full description

Saved in:
Bibliographic Details
Published in:Pattern recognition 2022-09, Vol.129, p.108721, Article 108721
Main Authors: Pagès-Zamora, Alba, Ochoa, Idoia, Cavero, Gonzalo Ruiz, Villalvilla-Ornat, Pol
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•The variant calling step in next generation sequencing technologies is formulated as a classification problem.•An unsupervised ensemble classification method is proposed as a variant caller for DNA sequencing.•An EM-based variant calling algorithm that estimates the maximum a posteriori class to take a decision is presented.•The number of classes to be decided is greater than the number of different labels that are observed.•Experimental results with real human DNA sequencing data support the approach. Unsupervised ensemble learning refers to methods devised for a particular task that combine data provided by decision learners taking into account their reliability, which is usually inferred from the data. Here, the variant calling step of the next generation sequencing technologies is formulated as an unsupervised ensemble classification problem. A variant calling algorithm based on the expectation-maximization algorithm is further proposed that estimates the maximum-a-posteriori decision among a number of classes larger than the number of different labels provided by the learners. Experimental results with real human DNA sequencing data show that the proposed algorithm is competitive compared to state-of-the-art variant callers as GATK, HTSLIB, and Platypus.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2022.108721