Loading…
Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures
The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computat...
Saved in:
Published in: | IEEE access 2024-01, Vol.12, p.1-1 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3 |
container_end_page | 1 |
container_issue | |
container_start_page | 1 |
container_title | IEEE access |
container_volume | 12 |
creator | Mayya, Veena King, Christian Vu, Giang T. Gurupur, Varadraj |
description | The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures. |
doi_str_mv | 10.1109/ACCESS.2024.3482192 |
format | article |
fullrecord | <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_500680133f6e4736a7c38b33a921d7c0</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10720013</ieee_id><doaj_id>oai_doaj_org_article_500680133f6e4736a7c38b33a921d7c0</doaj_id><sourcerecordid>3120654091</sourcerecordid><originalsourceid>FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3</originalsourceid><addsrcrecordid>eNpNkc9q3DAYxEVoIWGTJ2gPgp691b-1rd4Wx2kCWwp1exay9HmjxbG2khaSV-hTV46TEF0kBv1mBgahT5SsKSXy67Zp2q5bM8LEmouaUcnO0AWjpSz4hpcf3r3P0VWMB5JPnaVNdYH-tQ9HF5zRI-7SyT5hP-Ab0OkUAHcwgknOT_gHpHtvI3YT_gX7ADHO6uAD3umwh6LLPOBb0GO6Nzqj1zrpb3iLGx3h1XjCbUzuQSc37fE1TClnto9HmKyb4-Il-jjoMcLVy71Cf27a381tsfv5_a7Z7grDapkKQWDgpahkb3tO-GAlYdxIzeuKlVBWVU-AghVs4AOVlFNmetC276kFwYTlK3S3-FqvD-oYcqXwpLx26lnwYa90SM6MoDaElDWhnA8liIqXujK87jnXklFbGZK9vixex-D_niAmdfCnMOX6KgeTciNIrrBCfPllgo8xwPCWSomaN1TLhmreUL1smKnPC-UA4B1RMTJX-g_YWJfk</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3120654091</pqid></control><display><type>article</type><title>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</title><source>IEEE Open Access Journals</source><creator>Mayya, Veena ; King, Christian ; Vu, Giang T. ; Gurupur, Varadraj</creator><creatorcontrib>Mayya, Veena ; King, Christian ; Vu, Giang T. ; Gurupur, Varadraj</creatorcontrib><description>The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3482192</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Biological system modeling ; Clinical decision support systems ; Clinical diagnosis ; Computational modeling ; Demographic variables ; Dental care ; Dental visits ; Dentistry ; Expenditures ; Feature extraction ; Feature selection ; Health care ; Identification methods ; Insurance ; Machine learning ; Mathematical models ; Medical services ; Performance prediction ; Predictive models ; Real time ; Surveys ; Training</subject><ispartof>IEEE access, 2024-01, Vol.12, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3</cites><orcidid>0000-0003-1596-9298 ; 0000-0003-1157-5734 ; 0000-0003-1159-6973 ; 0000-0002-8091-5053</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10720013$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,27633,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Mayya, Veena</creatorcontrib><creatorcontrib>King, Christian</creatorcontrib><creatorcontrib>Vu, Giang T.</creatorcontrib><creatorcontrib>Gurupur, Varadraj</creatorcontrib><title>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</title><title>IEEE access</title><addtitle>Access</addtitle><description>The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.</description><subject>Biological system modeling</subject><subject>Clinical decision support systems</subject><subject>Clinical diagnosis</subject><subject>Computational modeling</subject><subject>Demographic variables</subject><subject>Dental care</subject><subject>Dental visits</subject><subject>Dentistry</subject><subject>Expenditures</subject><subject>Feature extraction</subject><subject>Feature selection</subject><subject>Health care</subject><subject>Identification methods</subject><subject>Insurance</subject><subject>Machine learning</subject><subject>Mathematical models</subject><subject>Medical services</subject><subject>Performance prediction</subject><subject>Predictive models</subject><subject>Real time</subject><subject>Surveys</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>DOA</sourceid><recordid>eNpNkc9q3DAYxEVoIWGTJ2gPgp691b-1rd4Wx2kCWwp1exay9HmjxbG2khaSV-hTV46TEF0kBv1mBgahT5SsKSXy67Zp2q5bM8LEmouaUcnO0AWjpSz4hpcf3r3P0VWMB5JPnaVNdYH-tQ9HF5zRI-7SyT5hP-Ab0OkUAHcwgknOT_gHpHtvI3YT_gX7ADHO6uAD3umwh6LLPOBb0GO6Nzqj1zrpb3iLGx3h1XjCbUzuQSc37fE1TClnto9HmKyb4-Il-jjoMcLVy71Cf27a381tsfv5_a7Z7grDapkKQWDgpahkb3tO-GAlYdxIzeuKlVBWVU-AghVs4AOVlFNmetC276kFwYTlK3S3-FqvD-oYcqXwpLx26lnwYa90SM6MoDaElDWhnA8liIqXujK87jnXklFbGZK9vixex-D_niAmdfCnMOX6KgeTciNIrrBCfPllgo8xwPCWSomaN1TLhmreUL1smKnPC-UA4B1RMTJX-g_YWJfk</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Mayya, Veena</creator><creator>King, Christian</creator><creator>Vu, Giang T.</creator><creator>Gurupur, Varadraj</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-1596-9298</orcidid><orcidid>https://orcid.org/0000-0003-1157-5734</orcidid><orcidid>https://orcid.org/0000-0003-1159-6973</orcidid><orcidid>https://orcid.org/0000-0002-8091-5053</orcidid></search><sort><creationdate>20240101</creationdate><title>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</title><author>Mayya, Veena ; King, Christian ; Vu, Giang T. ; Gurupur, Varadraj</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Biological system modeling</topic><topic>Clinical decision support systems</topic><topic>Clinical diagnosis</topic><topic>Computational modeling</topic><topic>Demographic variables</topic><topic>Dental care</topic><topic>Dental visits</topic><topic>Dentistry</topic><topic>Expenditures</topic><topic>Feature extraction</topic><topic>Feature selection</topic><topic>Health care</topic><topic>Identification methods</topic><topic>Insurance</topic><topic>Machine learning</topic><topic>Mathematical models</topic><topic>Medical services</topic><topic>Performance prediction</topic><topic>Predictive models</topic><topic>Real time</topic><topic>Surveys</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mayya, Veena</creatorcontrib><creatorcontrib>King, Christian</creatorcontrib><creatorcontrib>Vu, Giang T.</creatorcontrib><creatorcontrib>Gurupur, Varadraj</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mayya, Veena</au><au>King, Christian</au><au>Vu, Giang T.</au><au>Gurupur, Varadraj</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024-01-01</date><risdate>2024</risdate><volume>12</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3482192</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-1596-9298</orcidid><orcidid>https://orcid.org/0000-0003-1157-5734</orcidid><orcidid>https://orcid.org/0000-0003-1159-6973</orcidid><orcidid>https://orcid.org/0000-0002-8091-5053</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2024-01, Vol.12, p.1-1 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_500680133f6e4736a7c38b33a921d7c0 |
source | IEEE Open Access Journals |
subjects | Biological system modeling Clinical decision support systems Clinical diagnosis Computational modeling Demographic variables Dental care Dental visits Dentistry Expenditures Feature extraction Feature selection Health care Identification methods Insurance Machine learning Mathematical models Medical services Performance prediction Predictive models Real time Surveys Training |
title | Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T16%3A37%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Empirical%20Study%20of%20Feature%20Selection%20Methods%20in%20Regression%20for%20Large-Scale%20Healthcare%20Data:%20A%20Case%20Study%20on%20Estimating%20Dental%20Expenditures&rft.jtitle=IEEE%20access&rft.au=Mayya,%20Veena&rft.date=2024-01-01&rft.volume=12&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3482192&rft_dat=%3Cproquest_doaj_%3E3120654091%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3120654091&rft_id=info:pmid/&rft_ieee_id=10720013&rfr_iscdi=true |