Loading…

Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods

High dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk facto...

Full description

Saved in:
Bibliographic Details
Published in:BMC medical informatics and decision making 2022-11, Vol.22 (1), p.304-25, Article 304
Main Authors: Ebrahimi, Ali, Wiil, Uffe Kock, Naemi, Amin, Mansourvar, Marjan, Andersen, Kjeld, Nielsen, Anette Søgaard
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3
cites cdi_FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3
container_end_page 25
container_issue 1
container_start_page 304
container_title BMC medical informatics and decision making
container_volume 22
creator Ebrahimi, Ali
Wiil, Uffe Kock
Naemi, Amin
Mansourvar, Marjan
Andersen, Kjeld
Nielsen, Anette Søgaard
description High dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk factors related to prediction of clinical disorders. This paper examines the prediction of patients with alcohol use disorder (AUD) using machine learning (ML) and attempts to identify risk factors related to the diagnosis of AUD. A FS framework consisting of two operational levels, base selectors and ensemble selectors. The first level consists of five FS methods: three filter methods, one wrapper method, and one embedded method. Base selector outputs are aggregated to develop four ensemble FS methods. The outputs of FS method were then fed into three ML algorithms: support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to compare and identify the best feature subset for the prediction of AUD from EHRs. In terms of feature reduction, the embedded FS method could significantly reduce the number of features from 361 to 131. In terms of classification performance, RF based on 272 features selected by our proposed ensemble method (Union FS) with the highest accuracy in predicting patients with AUD, 96%, outperformed all other models in terms of AUROC, AUPRC, Precision, Recall, and F1-Score. Considering the limitations of embedded and wrapper methods, the best overall performance was achieved by our proposed Union Filter FS, which reduced the number of features to 223 and improved Precision, Recall, and F1-Score in RF from 0.77, 0.65, and 0.71 to 0.87, 0.81, and 0.84, respectively. Our findings indicate that, besides gender, age, and length of stay at the hospital, diagnosis related to digestive organs, bones, muscles and connective tissue, and the nervous systems are important clinical factors related to the prediction of patients with AUD. Our proposed FS method could improve the classification performance significantly. It could identify clinical factors related to prediction of AUD from EHRs, thereby effectively helping clinical staff to identify and treat AUD patients and improving medical knowledge of the AUD condition. Moreover, the diversity of features among female and male patients as well as gender disparity were investigated using FS methods and ML techniques.
doi_str_mv 10.1186/s12911-022-02051-w
format article
fullrecord <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_7c57513cdc254c9a96531e2a3eba456c</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A727807889</galeid><doaj_id>oai_doaj_org_article_7c57513cdc254c9a96531e2a3eba456c</doaj_id><sourcerecordid>A727807889</sourcerecordid><originalsourceid>FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3</originalsourceid><addsrcrecordid>eNptkslqHDEQhpuQEC_JC-QQBDm3raW19CVgTJYBQy7JWail0rSG7tZE0sT4KfLKkWfswQNBiJKq_vrQ8jfNB4KvCFHiOhPaE9JiSuvEnLT3r5pz0knair6Tr1-sz5qLnDcYE6kYf9ucMdHRjvfyvPm7crCU4IM1JcQFRY_sFJa6nZA3tsSUUYLJFHCoRLRN4IJ9VprJxjFOaJcBuZBjcpCQT3FGMIEtKVYOGsFMZawQW-u5asOyRh5M2SVAeS98xM1Qxujyu-aNN1OG90_xsvn19cvP2-_t3Y9vq9ubu9ZyhkvLWA3DgC3zxilBpTEDDJ5I12PBetY7K5RTnDHV1bwA4AJ8xwjBAjgz7LJZHbgumo3epjCb9KCjCXqfiGmtTSrBTqCl5ZITZp2lvLO96QVnBKhhMJiOC1tZnw-s7W6Ywdn6oMlMJ9DTyhJGvY5_dC-UwLKrgE9PgBR_7yAXvYm7tNT7ayo554rL-nFH1drUU4XFxwqzc8hW30gqFZZK9VV19R9VHQ7mYOMCPtT8SQM9NNgUc07gjwcnWD_6TB98pqvP9N5n-r42fXx55WPLs7HYP-8c0W0</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2755585778</pqid></control><display><type>article</type><title>Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods</title><source>Open Access: PubMed Central</source><source>Publicly Available Content (ProQuest)</source><source>Coronavirus Research Database</source><creator>Ebrahimi, Ali ; Wiil, Uffe Kock ; Naemi, Amin ; Mansourvar, Marjan ; Andersen, Kjeld ; Nielsen, Anette Søgaard</creator><creatorcontrib>Ebrahimi, Ali ; Wiil, Uffe Kock ; Naemi, Amin ; Mansourvar, Marjan ; Andersen, Kjeld ; Nielsen, Anette Søgaard</creatorcontrib><description>High dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk factors related to prediction of clinical disorders. This paper examines the prediction of patients with alcohol use disorder (AUD) using machine learning (ML) and attempts to identify risk factors related to the diagnosis of AUD. A FS framework consisting of two operational levels, base selectors and ensemble selectors. The first level consists of five FS methods: three filter methods, one wrapper method, and one embedded method. Base selector outputs are aggregated to develop four ensemble FS methods. The outputs of FS method were then fed into three ML algorithms: support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to compare and identify the best feature subset for the prediction of AUD from EHRs. In terms of feature reduction, the embedded FS method could significantly reduce the number of features from 361 to 131. In terms of classification performance, RF based on 272 features selected by our proposed ensemble method (Union FS) with the highest accuracy in predicting patients with AUD, 96%, outperformed all other models in terms of AUROC, AUPRC, Precision, Recall, and F1-Score. Considering the limitations of embedded and wrapper methods, the best overall performance was achieved by our proposed Union Filter FS, which reduced the number of features to 223 and improved Precision, Recall, and F1-Score in RF from 0.77, 0.65, and 0.71 to 0.87, 0.81, and 0.84, respectively. Our findings indicate that, besides gender, age, and length of stay at the hospital, diagnosis related to digestive organs, bones, muscles and connective tissue, and the nervous systems are important clinical factors related to the prediction of patients with AUD. Our proposed FS method could improve the classification performance significantly. It could identify clinical factors related to prediction of AUD from EHRs, thereby effectively helping clinical staff to identify and treat AUD patients and improving medical knowledge of the AUD condition. Moreover, the diversity of features among female and male patients as well as gender disparity were investigated using FS methods and ML techniques.</description><identifier>ISSN: 1472-6947</identifier><identifier>EISSN: 1472-6947</identifier><identifier>DOI: 10.1186/s12911-022-02051-w</identifier><identifier>PMID: 36424597</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Alcohol use ; Alcohol use disorder ; Alcoholism ; Alcoholism - diagnosis ; Algorithms ; Audits ; Bones ; Classification ; Clinical factor identification ; Cluster Analysis ; Computer applications ; Connective tissues ; Datasets ; Diagnosis ; Electronic Health Records ; Electronic medical records ; Electronic records ; Feature selection ; Female ; Gender ; Gender disparity ; Health informatics ; Hospitals ; Humans ; Identification ; Machine Learning ; Male ; Medical informatics ; Medical records ; Methods ; Muscles ; Orthopedics ; Patients ; Performance evaluation ; Predictions ; Recall ; Reduction ; Risk analysis ; Risk factors ; Selectors ; Support Vector Machine ; Support vector machines</subject><ispartof>BMC medical informatics and decision making, 2022-11, Vol.22 (1), p.304-25, Article 304</ispartof><rights>2022. The Author(s).</rights><rights>COPYRIGHT 2022 BioMed Central Ltd.</rights><rights>2022. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>The Author(s) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3</citedby><cites>FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9686074/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2755585778?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25753,27924,27925,37012,38516,43895,44590,53791,53793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36424597$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Ebrahimi, Ali</creatorcontrib><creatorcontrib>Wiil, Uffe Kock</creatorcontrib><creatorcontrib>Naemi, Amin</creatorcontrib><creatorcontrib>Mansourvar, Marjan</creatorcontrib><creatorcontrib>Andersen, Kjeld</creatorcontrib><creatorcontrib>Nielsen, Anette Søgaard</creatorcontrib><title>Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods</title><title>BMC medical informatics and decision making</title><addtitle>BMC Med Inform Decis Mak</addtitle><description>High dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk factors related to prediction of clinical disorders. This paper examines the prediction of patients with alcohol use disorder (AUD) using machine learning (ML) and attempts to identify risk factors related to the diagnosis of AUD. A FS framework consisting of two operational levels, base selectors and ensemble selectors. The first level consists of five FS methods: three filter methods, one wrapper method, and one embedded method. Base selector outputs are aggregated to develop four ensemble FS methods. The outputs of FS method were then fed into three ML algorithms: support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to compare and identify the best feature subset for the prediction of AUD from EHRs. In terms of feature reduction, the embedded FS method could significantly reduce the number of features from 361 to 131. In terms of classification performance, RF based on 272 features selected by our proposed ensemble method (Union FS) with the highest accuracy in predicting patients with AUD, 96%, outperformed all other models in terms of AUROC, AUPRC, Precision, Recall, and F1-Score. Considering the limitations of embedded and wrapper methods, the best overall performance was achieved by our proposed Union Filter FS, which reduced the number of features to 223 and improved Precision, Recall, and F1-Score in RF from 0.77, 0.65, and 0.71 to 0.87, 0.81, and 0.84, respectively. Our findings indicate that, besides gender, age, and length of stay at the hospital, diagnosis related to digestive organs, bones, muscles and connective tissue, and the nervous systems are important clinical factors related to the prediction of patients with AUD. Our proposed FS method could improve the classification performance significantly. It could identify clinical factors related to prediction of AUD from EHRs, thereby effectively helping clinical staff to identify and treat AUD patients and improving medical knowledge of the AUD condition. Moreover, the diversity of features among female and male patients as well as gender disparity were investigated using FS methods and ML techniques.</description><subject>Alcohol use</subject><subject>Alcohol use disorder</subject><subject>Alcoholism</subject><subject>Alcoholism - diagnosis</subject><subject>Algorithms</subject><subject>Audits</subject><subject>Bones</subject><subject>Classification</subject><subject>Clinical factor identification</subject><subject>Cluster Analysis</subject><subject>Computer applications</subject><subject>Connective tissues</subject><subject>Datasets</subject><subject>Diagnosis</subject><subject>Electronic Health Records</subject><subject>Electronic medical records</subject><subject>Electronic records</subject><subject>Feature selection</subject><subject>Female</subject><subject>Gender</subject><subject>Gender disparity</subject><subject>Health informatics</subject><subject>Hospitals</subject><subject>Humans</subject><subject>Identification</subject><subject>Machine Learning</subject><subject>Male</subject><subject>Medical informatics</subject><subject>Medical records</subject><subject>Methods</subject><subject>Muscles</subject><subject>Orthopedics</subject><subject>Patients</subject><subject>Performance evaluation</subject><subject>Predictions</subject><subject>Recall</subject><subject>Reduction</subject><subject>Risk analysis</subject><subject>Risk factors</subject><subject>Selectors</subject><subject>Support Vector Machine</subject><subject>Support vector machines</subject><issn>1472-6947</issn><issn>1472-6947</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>COVID</sourceid><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNptkslqHDEQhpuQEC_JC-QQBDm3raW19CVgTJYBQy7JWail0rSG7tZE0sT4KfLKkWfswQNBiJKq_vrQ8jfNB4KvCFHiOhPaE9JiSuvEnLT3r5pz0knair6Tr1-sz5qLnDcYE6kYf9ucMdHRjvfyvPm7crCU4IM1JcQFRY_sFJa6nZA3tsSUUYLJFHCoRLRN4IJ9VprJxjFOaJcBuZBjcpCQT3FGMIEtKVYOGsFMZawQW-u5asOyRh5M2SVAeS98xM1Qxujyu-aNN1OG90_xsvn19cvP2-_t3Y9vq9ubu9ZyhkvLWA3DgC3zxilBpTEDDJ5I12PBetY7K5RTnDHV1bwA4AJ8xwjBAjgz7LJZHbgumo3epjCb9KCjCXqfiGmtTSrBTqCl5ZITZp2lvLO96QVnBKhhMJiOC1tZnw-s7W6Ywdn6oMlMJ9DTyhJGvY5_dC-UwLKrgE9PgBR_7yAXvYm7tNT7ayo554rL-nFH1drUU4XFxwqzc8hW30gqFZZK9VV19R9VHQ7mYOMCPtT8SQM9NNgUc07gjwcnWD_6TB98pqvP9N5n-r42fXx55WPLs7HYP-8c0W0</recordid><startdate>20221123</startdate><enddate>20221123</enddate><creator>Ebrahimi, Ali</creator><creator>Wiil, Uffe Kock</creator><creator>Naemi, Amin</creator><creator>Mansourvar, Marjan</creator><creator>Andersen, Kjeld</creator><creator>Nielsen, Anette Søgaard</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><general>BMC</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7QO</scope><scope>7SC</scope><scope>7X7</scope><scope>7XB</scope><scope>88C</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>COVID</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>L7M</scope><scope>LK8</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M0S</scope><scope>M0T</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20221123</creationdate><title>Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods</title><author>Ebrahimi, Ali ; Wiil, Uffe Kock ; Naemi, Amin ; Mansourvar, Marjan ; Andersen, Kjeld ; Nielsen, Anette Søgaard</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Alcohol use</topic><topic>Alcohol use disorder</topic><topic>Alcoholism</topic><topic>Alcoholism - diagnosis</topic><topic>Algorithms</topic><topic>Audits</topic><topic>Bones</topic><topic>Classification</topic><topic>Clinical factor identification</topic><topic>Cluster Analysis</topic><topic>Computer applications</topic><topic>Connective tissues</topic><topic>Datasets</topic><topic>Diagnosis</topic><topic>Electronic Health Records</topic><topic>Electronic medical records</topic><topic>Electronic records</topic><topic>Feature selection</topic><topic>Female</topic><topic>Gender</topic><topic>Gender disparity</topic><topic>Health informatics</topic><topic>Hospitals</topic><topic>Humans</topic><topic>Identification</topic><topic>Machine Learning</topic><topic>Male</topic><topic>Medical informatics</topic><topic>Medical records</topic><topic>Methods</topic><topic>Muscles</topic><topic>Orthopedics</topic><topic>Patients</topic><topic>Performance evaluation</topic><topic>Predictions</topic><topic>Recall</topic><topic>Reduction</topic><topic>Risk analysis</topic><topic>Risk factors</topic><topic>Selectors</topic><topic>Support Vector Machine</topic><topic>Support vector machines</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ebrahimi, Ali</creatorcontrib><creatorcontrib>Wiil, Uffe Kock</creatorcontrib><creatorcontrib>Naemi, Amin</creatorcontrib><creatorcontrib>Mansourvar, Marjan</creatorcontrib><creatorcontrib>Andersen, Kjeld</creatorcontrib><creatorcontrib>Nielsen, Anette Søgaard</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest_Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Healthcare Administration Database (Alumni)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>Coronavirus Research Database</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>ProQuest Biological Science Collection</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Healthcare Administration Database</collection><collection>Medical Database</collection><collection>ProQuest Biological Science Journals</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>BMC medical informatics and decision making</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ebrahimi, Ali</au><au>Wiil, Uffe Kock</au><au>Naemi, Amin</au><au>Mansourvar, Marjan</au><au>Andersen, Kjeld</au><au>Nielsen, Anette Søgaard</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods</atitle><jtitle>BMC medical informatics and decision making</jtitle><addtitle>BMC Med Inform Decis Mak</addtitle><date>2022-11-23</date><risdate>2022</risdate><volume>22</volume><issue>1</issue><spage>304</spage><epage>25</epage><pages>304-25</pages><artnum>304</artnum><issn>1472-6947</issn><eissn>1472-6947</eissn><abstract>High dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk factors related to prediction of clinical disorders. This paper examines the prediction of patients with alcohol use disorder (AUD) using machine learning (ML) and attempts to identify risk factors related to the diagnosis of AUD. A FS framework consisting of two operational levels, base selectors and ensemble selectors. The first level consists of five FS methods: three filter methods, one wrapper method, and one embedded method. Base selector outputs are aggregated to develop four ensemble FS methods. The outputs of FS method were then fed into three ML algorithms: support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to compare and identify the best feature subset for the prediction of AUD from EHRs. In terms of feature reduction, the embedded FS method could significantly reduce the number of features from 361 to 131. In terms of classification performance, RF based on 272 features selected by our proposed ensemble method (Union FS) with the highest accuracy in predicting patients with AUD, 96%, outperformed all other models in terms of AUROC, AUPRC, Precision, Recall, and F1-Score. Considering the limitations of embedded and wrapper methods, the best overall performance was achieved by our proposed Union Filter FS, which reduced the number of features to 223 and improved Precision, Recall, and F1-Score in RF from 0.77, 0.65, and 0.71 to 0.87, 0.81, and 0.84, respectively. Our findings indicate that, besides gender, age, and length of stay at the hospital, diagnosis related to digestive organs, bones, muscles and connective tissue, and the nervous systems are important clinical factors related to the prediction of patients with AUD. Our proposed FS method could improve the classification performance significantly. It could identify clinical factors related to prediction of AUD from EHRs, thereby effectively helping clinical staff to identify and treat AUD patients and improving medical knowledge of the AUD condition. Moreover, the diversity of features among female and male patients as well as gender disparity were investigated using FS methods and ML techniques.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>36424597</pmid><doi>10.1186/s12911-022-02051-w</doi><tpages>25</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1472-6947
ispartof BMC medical informatics and decision making, 2022-11, Vol.22 (1), p.304-25, Article 304
issn 1472-6947
1472-6947
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_7c57513cdc254c9a96531e2a3eba456c
source Open Access: PubMed Central; Publicly Available Content (ProQuest); Coronavirus Research Database
subjects Alcohol use
Alcohol use disorder
Alcoholism
Alcoholism - diagnosis
Algorithms
Audits
Bones
Classification
Clinical factor identification
Cluster Analysis
Computer applications
Connective tissues
Datasets
Diagnosis
Electronic Health Records
Electronic medical records
Electronic records
Feature selection
Female
Gender
Gender disparity
Health informatics
Hospitals
Humans
Identification
Machine Learning
Male
Medical informatics
Medical records
Methods
Muscles
Orthopedics
Patients
Performance evaluation
Predictions
Recall
Reduction
Risk analysis
Risk factors
Selectors
Support Vector Machine
Support vector machines
title Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T02%3A15%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Identification%20of%20clinical%20factors%20related%20to%20prediction%20of%20alcohol%20use%20disorder%20from%20electronic%20health%20records%20using%20feature%20selection%20methods&rft.jtitle=BMC%20medical%20informatics%20and%20decision%20making&rft.au=Ebrahimi,%20Ali&rft.date=2022-11-23&rft.volume=22&rft.issue=1&rft.spage=304&rft.epage=25&rft.pages=304-25&rft.artnum=304&rft.issn=1472-6947&rft.eissn=1472-6947&rft_id=info:doi/10.1186/s12911-022-02051-w&rft_dat=%3Cgale_doaj_%3EA727807889%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c530t-33530bb0c3fad8627aabebf17d9063939dc68d853384ebf6ee56ef431106e53a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2755585778&rft_id=info:pmid/36424597&rft_galeid=A727807889&rfr_iscdi=true