Loading…

Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data

Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manual...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-10
Main Authors: Danrui Qi, Peng, Jinglin, He, Yongjun, Wang, Jiannan
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Danrui Qi
Peng, Jinglin
He, Yongjun
Wang, Jiannan
description Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2873070073</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2873070073</sourcerecordid><originalsourceid>FETCH-proquest_journals_28730700733</originalsourceid><addsrcrecordid>eNqNi8sKgkAUQIcgSMp_uNBamGaykXZSSkup9nLLayjq2Dyg_j6DPqDVWZxzZiwQUm6iZCvEgoXWtpxzsVMijmXAzql3OsqLPaQDZK-RTNPT4LCDi_PVG3QN36JHRxXkhM4bgsLQaPSdrG2GB9TawBVvvkMDR3S4YvMaO0vhj0u2zrPr4RRNz9OTdWWrvRkmVYpESa44V1L-V30AcHg-4g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2873070073</pqid></control><display><type>article</type><title>Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data</title><source>Publicly Available Content Database</source><creator>Danrui Qi ; Peng, Jinglin ; He, Yongjun ; Wang, Jiannan</creator><creatorcontrib>Danrui Qi ; Peng, Jinglin ; He, Yongjun ; Wang, Jiannan</creatorcontrib><description>Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Automation ; Evolutionary algorithms ; Machine learning ; Optimization ; Preprocessing ; Search algorithms ; Tables (data)</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2873070073?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Danrui Qi</creatorcontrib><creatorcontrib>Peng, Jinglin</creatorcontrib><creatorcontrib>He, Yongjun</creatorcontrib><creatorcontrib>Wang, Jiannan</creatorcontrib><title>Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data</title><title>arXiv.org</title><description>Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.</description><subject>Algorithms</subject><subject>Automation</subject><subject>Evolutionary algorithms</subject><subject>Machine learning</subject><subject>Optimization</subject><subject>Preprocessing</subject><subject>Search algorithms</subject><subject>Tables (data)</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi8sKgkAUQIcgSMp_uNBamGaykXZSSkup9nLLayjq2Dyg_j6DPqDVWZxzZiwQUm6iZCvEgoXWtpxzsVMijmXAzql3OsqLPaQDZK-RTNPT4LCDi_PVG3QN36JHRxXkhM4bgsLQaPSdrG2GB9TawBVvvkMDR3S4YvMaO0vhj0u2zrPr4RRNz9OTdWWrvRkmVYpESa44V1L-V30AcHg-4g</recordid><startdate>20231004</startdate><enddate>20231004</enddate><creator>Danrui Qi</creator><creator>Peng, Jinglin</creator><creator>He, Yongjun</creator><creator>Wang, Jiannan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231004</creationdate><title>Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data</title><author>Danrui Qi ; Peng, Jinglin ; He, Yongjun ; Wang, Jiannan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28730700733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Automation</topic><topic>Evolutionary algorithms</topic><topic>Machine learning</topic><topic>Optimization</topic><topic>Preprocessing</topic><topic>Search algorithms</topic><topic>Tables (data)</topic><toplevel>online_resources</toplevel><creatorcontrib>Danrui Qi</creatorcontrib><creatorcontrib>Peng, Jinglin</creatorcontrib><creatorcontrib>He, Yongjun</creatorcontrib><creatorcontrib>Wang, Jiannan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Danrui Qi</au><au>Peng, Jinglin</au><au>He, Yongjun</au><au>Wang, Jiannan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data</atitle><jtitle>arXiv.org</jtitle><date>2023-10-04</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Classical machine learning models, such as linear models and tree-based models, are widely used in industry. These models are sensitive to data distribution, thus feature preprocessing, which transforms features from one distribution to another, is a crucial step to ensure good model quality. Manually constructing a feature preprocessing pipeline is challenging because data scientists need to make difficult decisions about which preprocessors to select and in which order to compose them. In this paper, we study how to automate feature preprocessing (Auto-FP) for tabular data. Due to the large search space, a brute-force solution is prohibitively expensive. To address this challenge, we interestingly observe that Auto-FP can be modelled as either a hyperparameter optimization (HPO) or a neural architecture search (NAS) problem. This observation enables us to extend a variety of HPO and NAS algorithms to solve the Auto-FP problem. We conduct a comprehensive evaluation and analysis of 15 algorithms on 45 public ML datasets. Overall, evolution-based algorithms show the leading average ranking. Surprisingly, the random search turns out to be a strong baseline. Many surrogate-model-based and bandit-based search algorithms, which achieve good performance for HPO and NAS, do not outperform random search for Auto-FP. We analyze the reasons for our findings and conduct a bottleneck analysis to identify the opportunities to improve these algorithms. Furthermore, we explore how to extend Auto-FP to support parameter search and compare two ways to achieve this goal. In the end, we evaluate Auto-FP in an AutoML context and discuss the limitations of popular AutoML tools. To the best of our knowledge, this is the first study on automated feature preprocessing. We hope our work can inspire researchers to develop new algorithms tailored for Auto-FP.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2873070073
source Publicly Available Content Database
subjects Algorithms
Automation
Evolutionary algorithms
Machine learning
Optimization
Preprocessing
Search algorithms
Tables (data)
title Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T09%3A37%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Auto-FP:%20An%20Experimental%20Study%20of%20Automated%20Feature%20Preprocessing%20for%20Tabular%20Data&rft.jtitle=arXiv.org&rft.au=Danrui%20Qi&rft.date=2023-10-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2873070073%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28730700733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2873070073&rft_id=info:pmid/&rfr_iscdi=true