Loading…

ICD2Vec: Mathematical representation of diseases

[Display omitted] •We developed a universal framework to convert ICD codes into mathematical vectors that capture semantic relationships among diseases.•We evaluated our algorithm by analogical reasoning and comparing ICD2Vec results with biological relationships.•We further developed an individual...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomedical informatics 2023-05, Vol.141, p.104361-104361, Article 104361
Main Authors: Lee, Yeong Chan, Jung, Sang-Hyuk, Kumar, Aman, Shim, Injeong, Song, Minku, Kim, Min Seo, Kim, Kyunga, Myung, Woojae, Park, Woong-Yang, Won, Hong-Hee
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[Display omitted] •We developed a universal framework to convert ICD codes into mathematical vectors that capture semantic relationships among diseases.•We evaluated our algorithm by analogical reasoning and comparing ICD2Vec results with biological relationships.•We further developed an individual risk score based on ICD2Vec and evaluated prediction performance of the score using two large datasets. The International Classification of Diseases (ICD) codes represent the global standard for reporting disease conditions. The current ICD codes connote direct human-defined relationships among diseases in a hierarchical tree structure. Representing the ICD codes as mathematical vectors helps to capture nonlinear relationships in medical ontologies across diseases. We propose a universally applicable framework called “ICD2Vec” designed to provide mathematical representations of diseases by encoding corresponding information. First, we present the arithmetical and semantic relationships between diseases by mapping composite vectors for symptoms or diseases to the most similar ICD codes. Second, we investigated the validity of ICD2Vec by comparing the biological relationships and cosine similarities among the vectorized ICD codes. Third, we propose a new risk score called IRIS, derived from ICD2Vec, and demonstrate its clinical utility with large cohorts from the UK and South Korea. Semantic compositionality was qualitatively confirmed between descriptions of symptoms and ICD2Vec. For example, the diseases most similar to COVID-19 were found to be the common cold (ICD-10: J00), unspecified viral hemorrhagic fever (ICD-10: A99), and smallpox (ICD-10: B03). We show the significant associations between the cosine similarities derived from ICD2Vec and the biological relationships using disease-to-disease pairs. Furthermore, we observed significant adjusted hazard ratios (HR) and area under the receiver operating characteristics (AUROC) between IRIS and risks for eight diseases. For instance, the higher IRIS for coronary artery disease (CAD) can be the higher probability for the incidence of CAD (HR: 2.15 [95% CI 2.02–2.28] and AUROC: 0.587 [95% CI 0.583–0.591]). We identified individuals at substantially increased risk of CAD using IRIS and 10-year atherosclerotic cardiovascular disease risk (adjusted HR: 4.26 [95% CI 3.59–5.05]). ICD2Vec, a proposed universal framework for converting qualitatively measured ICD codes into quantitative vectors containing semantic relat
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2023.104361