Loading…

Machine learning analysis of genomic signatures provides evidence of associations between Wuhan 2019-nCoV and bat betacoronaviruses

As of February 8, 2020, the 2019 Novel Coronavirus (2019-nCoV) spread to 29 countries with 725 deaths and more than 34000 confirmed cases. 2019-nCoV is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6%...

Full description

Saved in:
Bibliographic Details
Published in:bioRxiv 2020-02
Main Authors: Randhawa, Gurjit S, Maximillian Pm Soltysiak, Hadi El Roz, Camila Pe De Souza, Hill, Kathleen A, Lila Kari
Format: Article
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As of February 8, 2020, the 2019 Novel Coronavirus (2019-nCoV) spread to 29 countries with 725 deaths and more than 34000 confirmed cases. 2019-nCoV is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though 2019-nCoV has a death rate of 2% as of 8 February, the 34963 confirmed cases in a few weeks (December 8, 2019 to February 8, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper proposes the use of a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole 2019-nCoV genomes. We namely classify the 2019-nCoV using MLDSP and MLDSP-GUI, alignment-free methods that use Machine Learning (ML) and Digital Signal Processing (DSP) for genome analyses. These tools are used to analyze a large dataset of unique viral genomic sequences, totalling 61.8 million bp, with a "decision tree" approach for successive refinements of taxonomic classification. Our results support the hypothesis of a bat origin and classify 2019-nCoV as Sarbecovirus, within Betacoronavirus. We use Spearman's rank correlation analysis to confirm the relatedness of the 2019-nCoV sequences to the known genera of the family Coronaviridae, and the known sub-genera of the genus Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within seconds, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
DOI:10.1101/2020.02.03.932350