Loading…

Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures

The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computat...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2024-01, Vol.12, p.1-1
Main Authors: Mayya, Veena, King, Christian, Vu, Giang T., Gurupur, Varadraj
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3
container_end_page 1
container_issue
container_start_page 1
container_title IEEE access
container_volume 12
creator Mayya, Veena
King, Christian
Vu, Giang T.
Gurupur, Varadraj
description The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.
doi_str_mv 10.1109/ACCESS.2024.3482192
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_500680133f6e4736a7c38b33a921d7c0</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10720013</ieee_id><doaj_id>oai_doaj_org_article_500680133f6e4736a7c38b33a921d7c0</doaj_id><sourcerecordid>3120654091</sourcerecordid><originalsourceid>FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3</originalsourceid><addsrcrecordid>eNpNkc9q3DAYxEVoIWGTJ2gPgp691b-1rd4Wx2kCWwp1exay9HmjxbG2khaSV-hTV46TEF0kBv1mBgahT5SsKSXy67Zp2q5bM8LEmouaUcnO0AWjpSz4hpcf3r3P0VWMB5JPnaVNdYH-tQ9HF5zRI-7SyT5hP-Ab0OkUAHcwgknOT_gHpHtvI3YT_gX7ADHO6uAD3umwh6LLPOBb0GO6Nzqj1zrpb3iLGx3h1XjCbUzuQSc37fE1TClnto9HmKyb4-Il-jjoMcLVy71Cf27a381tsfv5_a7Z7grDapkKQWDgpahkb3tO-GAlYdxIzeuKlVBWVU-AghVs4AOVlFNmetC276kFwYTlK3S3-FqvD-oYcqXwpLx26lnwYa90SM6MoDaElDWhnA8liIqXujK87jnXklFbGZK9vixex-D_niAmdfCnMOX6KgeTciNIrrBCfPllgo8xwPCWSomaN1TLhmreUL1smKnPC-UA4B1RMTJX-g_YWJfk</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3120654091</pqid></control><display><type>article</type><title>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</title><source>IEEE Open Access Journals</source><creator>Mayya, Veena ; King, Christian ; Vu, Giang T. ; Gurupur, Varadraj</creator><creatorcontrib>Mayya, Veena ; King, Christian ; Vu, Giang T. ; Gurupur, Varadraj</creatorcontrib><description>The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3482192</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Biological system modeling ; Clinical decision support systems ; Clinical diagnosis ; Computational modeling ; Demographic variables ; Dental care ; Dental visits ; Dentistry ; Expenditures ; Feature extraction ; Feature selection ; Health care ; Identification methods ; Insurance ; Machine learning ; Mathematical models ; Medical services ; Performance prediction ; Predictive models ; Real time ; Surveys ; Training</subject><ispartof>IEEE access, 2024-01, Vol.12, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3</cites><orcidid>0000-0003-1596-9298 ; 0000-0003-1157-5734 ; 0000-0003-1159-6973 ; 0000-0002-8091-5053</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10720013$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,27633,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Mayya, Veena</creatorcontrib><creatorcontrib>King, Christian</creatorcontrib><creatorcontrib>Vu, Giang T.</creatorcontrib><creatorcontrib>Gurupur, Varadraj</creatorcontrib><title>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</title><title>IEEE access</title><addtitle>Access</addtitle><description>The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.</description><subject>Biological system modeling</subject><subject>Clinical decision support systems</subject><subject>Clinical diagnosis</subject><subject>Computational modeling</subject><subject>Demographic variables</subject><subject>Dental care</subject><subject>Dental visits</subject><subject>Dentistry</subject><subject>Expenditures</subject><subject>Feature extraction</subject><subject>Feature selection</subject><subject>Health care</subject><subject>Identification methods</subject><subject>Insurance</subject><subject>Machine learning</subject><subject>Mathematical models</subject><subject>Medical services</subject><subject>Performance prediction</subject><subject>Predictive models</subject><subject>Real time</subject><subject>Surveys</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>DOA</sourceid><recordid>eNpNkc9q3DAYxEVoIWGTJ2gPgp691b-1rd4Wx2kCWwp1exay9HmjxbG2khaSV-hTV46TEF0kBv1mBgahT5SsKSXy67Zp2q5bM8LEmouaUcnO0AWjpSz4hpcf3r3P0VWMB5JPnaVNdYH-tQ9HF5zRI-7SyT5hP-Ab0OkUAHcwgknOT_gHpHtvI3YT_gX7ADHO6uAD3umwh6LLPOBb0GO6Nzqj1zrpb3iLGx3h1XjCbUzuQSc37fE1TClnto9HmKyb4-Il-jjoMcLVy71Cf27a381tsfv5_a7Z7grDapkKQWDgpahkb3tO-GAlYdxIzeuKlVBWVU-AghVs4AOVlFNmetC276kFwYTlK3S3-FqvD-oYcqXwpLx26lnwYa90SM6MoDaElDWhnA8liIqXujK87jnXklFbGZK9vixex-D_niAmdfCnMOX6KgeTciNIrrBCfPllgo8xwPCWSomaN1TLhmreUL1smKnPC-UA4B1RMTJX-g_YWJfk</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Mayya, Veena</creator><creator>King, Christian</creator><creator>Vu, Giang T.</creator><creator>Gurupur, Varadraj</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-1596-9298</orcidid><orcidid>https://orcid.org/0000-0003-1157-5734</orcidid><orcidid>https://orcid.org/0000-0003-1159-6973</orcidid><orcidid>https://orcid.org/0000-0002-8091-5053</orcidid></search><sort><creationdate>20240101</creationdate><title>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</title><author>Mayya, Veena ; King, Christian ; Vu, Giang T. ; Gurupur, Varadraj</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Biological system modeling</topic><topic>Clinical decision support systems</topic><topic>Clinical diagnosis</topic><topic>Computational modeling</topic><topic>Demographic variables</topic><topic>Dental care</topic><topic>Dental visits</topic><topic>Dentistry</topic><topic>Expenditures</topic><topic>Feature extraction</topic><topic>Feature selection</topic><topic>Health care</topic><topic>Identification methods</topic><topic>Insurance</topic><topic>Machine learning</topic><topic>Mathematical models</topic><topic>Medical services</topic><topic>Performance prediction</topic><topic>Predictive models</topic><topic>Real time</topic><topic>Surveys</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mayya, Veena</creatorcontrib><creatorcontrib>King, Christian</creatorcontrib><creatorcontrib>Vu, Giang T.</creatorcontrib><creatorcontrib>Gurupur, Varadraj</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mayya, Veena</au><au>King, Christian</au><au>Vu, Giang T.</au><au>Gurupur, Varadraj</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024-01-01</date><risdate>2024</risdate><volume>12</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>The complexity and high dimensionality of healthcare data present substantial challenges in building machine learning (ML) models, given the large number of variables such as patient demographics and medical history. Effective feature selection is crucial to address issues such as increased computational resource, longer training times, overfitting, and reduced model interpretability, etc. This study evaluates a range of feature selection methods to identify the most impactful features for predicting dental expenditures using publicly available Medical Expenditure Panel Survey (MEPS) data. Sixteen ML models are assessed to determine the top performing model, after which state-of-the-art filter, wrapper, embedded, and hybrid feature selection techniques are applied to optimize the feature set. The highest performance, in terms of coefficient of determination (R2), is achieved using a hybrid feature selection method that combines the mutual information filter with the embedded features from the CatBoost regressor. The results indicate that the proposed system is suitable for real-time deployment even with reduced features, providing potential benefits such as minimizing the need for irrelevant and difficult-to-obtain features. Moreover, automated feature selection significantly enhances model performance, yielding a R2 score of 0.86, compared to the score 0.72 achieved with carefully selected manual features. Additionally, to enhance the interpretability of the top-performing ML model, explanatory visualizations are employed to examine the influence of key features on predicting dental expenditures.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3482192</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-1596-9298</orcidid><orcidid>https://orcid.org/0000-0003-1157-5734</orcidid><orcidid>https://orcid.org/0000-0003-1159-6973</orcidid><orcidid>https://orcid.org/0000-0002-8091-5053</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2024-01, Vol.12, p.1-1
issn 2169-3536
2169-3536
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_500680133f6e4736a7c38b33a921d7c0
source IEEE Open Access Journals
subjects Biological system modeling
Clinical decision support systems
Clinical diagnosis
Computational modeling
Demographic variables
Dental care
Dental visits
Dentistry
Expenditures
Feature extraction
Feature selection
Health care
Identification methods
Insurance
Machine learning
Mathematical models
Medical services
Performance prediction
Predictive models
Real time
Surveys
Training
title Empirical Study of Feature Selection Methods in Regression for Large-Scale Healthcare Data: A Case Study on Estimating Dental Expenditures
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T16%3A37%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Empirical%20Study%20of%20Feature%20Selection%20Methods%20in%20Regression%20for%20Large-Scale%20Healthcare%20Data:%20A%20Case%20Study%20on%20Estimating%20Dental%20Expenditures&rft.jtitle=IEEE%20access&rft.au=Mayya,%20Veena&rft.date=2024-01-01&rft.volume=12&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3482192&rft_dat=%3Cproquest_doaj_%3E3120654091%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c289t-40ef36479bdb303fd9023c9a38726e677b0e1ed42f3f191312cbeadbb1de424d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3120654091&rft_id=info:pmid/&rft_ieee_id=10720013&rfr_iscdi=true