Loading…
Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization
We suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. The balance optimization subset selection problem, which formulates minimization of aggregate imbalance in covariate distributions to reduce bias in data,...
Saved in:
Published in: | Complex & intelligent systems 2021-02, Vol.7 (1), p.41-59 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3 |
---|---|
cites | cdi_FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3 |
container_end_page | 59 |
container_issue | 1 |
container_start_page | 41 |
container_title | Complex & intelligent systems |
container_volume | 7 |
creator | Sharma, Dhruv Willy, Christopher Bischoff, John |
description | We suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. The balance optimization subset selection problem, which formulates minimization of aggregate imbalance in covariate distributions to reduce bias in data, is a new area of study in operations research. We investigate a novel metric, cross-validated area under the receiver operating characteristic curve (AUC) as a measure of balance between treatment and control groups. The proposed approach provides direct and automatic balancing of covariate distributions. In addition, the AUC-based approach is able to detect subtler distributional differences than existing measures, such as simple empirical mean/variance and count-based metrics. Thus, optimizing AUCs achieves a greater balance than the existing methods. Using 5 widely used real data sets and 7 synthetic data sets, we show that optimization of samples using existing methods (Chi-square, mean variance differences, Kolmogorov–Smirnov, and Mahalanobis) results in samples containing imbalance that is detectable using machine learning ensembles. We minimize covariate imbalance by minimizing the absolute value of the distance of the maximum cross-validated AUC on
M
folds from 0.50, using evolutionary optimization. We demonstrate that particle swarm optimization (PSO) outperforms modified cuckoo swarm (MCS) for a gradient-free, non-linear noisy cost function. To compute AUCs, we use supervised binary classification approaches from the machine learning and credit scoring literature. Using superscore ensembles adds to the classifier-based two-sample testing literature. If the mean cross-validated AUC based on machine learning is 0.50, the two groups are indistinguishable and suitable for causal inference. |
doi_str_mv | 10.1007/s40747-020-00169-w |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2492788967</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2492788967</sourcerecordid><originalsourceid>FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3</originalsourceid><addsrcrecordid>eNp9kEtLxDAQx4MouKz7BTwFPEfTvHOUxRcIe9FzSLPTtUub1qRl0U9vdyt48zTDzP8BP4SuC3pbUKrvsqBaaEIZJZQWypLDGVqwwhqiqOTnp90SIbm6RKuc93RSaW04ZQsUN_1Qt77BeSwzDDhDA2Gou4irLuHgxzz96lhBghgAj7mOO9z68FFHwA34FI8HiBnasoGMfdzi3qehDg3gfPCpxd2xof72x9QrdFH5JsPqdy7R--PD2_qZvG6eXtb3ryQIrgZihdVMaSpKDUpKr5gKyrKtldqbrQHDjIKKS18ZUUhtjeSGWylkybmoZOBLdDPn9qn7HCEPbt-NKU6VjgnLtDFW6UnFZlVIXc4JKtenCUb6cgV1R7RuRusmtO6E1h0mE59NeRLHHaS_6H9cP1kkfT4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2492788967</pqid></control><display><type>article</type><title>Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization</title><source>Publicly Available Content Database</source><source>Springer Nature - SpringerLink Journals - Fully Open Access </source><creator>Sharma, Dhruv ; Willy, Christopher ; Bischoff, John</creator><creatorcontrib>Sharma, Dhruv ; Willy, Christopher ; Bischoff, John</creatorcontrib><description>We suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. The balance optimization subset selection problem, which formulates minimization of aggregate imbalance in covariate distributions to reduce bias in data, is a new area of study in operations research. We investigate a novel metric, cross-validated area under the receiver operating characteristic curve (AUC) as a measure of balance between treatment and control groups. The proposed approach provides direct and automatic balancing of covariate distributions. In addition, the AUC-based approach is able to detect subtler distributional differences than existing measures, such as simple empirical mean/variance and count-based metrics. Thus, optimizing AUCs achieves a greater balance than the existing methods. Using 5 widely used real data sets and 7 synthetic data sets, we show that optimization of samples using existing methods (Chi-square, mean variance differences, Kolmogorov–Smirnov, and Mahalanobis) results in samples containing imbalance that is detectable using machine learning ensembles. We minimize covariate imbalance by minimizing the absolute value of the distance of the maximum cross-validated AUC on
M
folds from 0.50, using evolutionary optimization. We demonstrate that particle swarm optimization (PSO) outperforms modified cuckoo swarm (MCS) for a gradient-free, non-linear noisy cost function. To compute AUCs, we use supervised binary classification approaches from the machine learning and credit scoring literature. Using superscore ensembles adds to the classifier-based two-sample testing literature. If the mean cross-validated AUC based on machine learning is 0.50, the two groups are indistinguishable and suitable for causal inference.</description><identifier>ISSN: 2199-4536</identifier><identifier>EISSN: 2198-6053</identifier><identifier>DOI: 10.1007/s40747-020-00169-w</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Automatic control ; Complexity ; Computational Intelligence ; Cost function ; Data Structures and Information Theory ; Datasets ; Engineering ; Inference ; Machine learning ; Operations research ; Optimization ; Original Article ; Particle swarm optimization ; Tempering</subject><ispartof>Complex & intelligent systems, 2021-02, Vol.7 (1), p.41-59</ispartof><rights>The Author(s) 2020</rights><rights>The Author(s) 2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3</citedby><cites>FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2492788967?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,25733,27903,27904,36991,44569</link.rule.ids></links><search><creatorcontrib>Sharma, Dhruv</creatorcontrib><creatorcontrib>Willy, Christopher</creatorcontrib><creatorcontrib>Bischoff, John</creatorcontrib><title>Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization</title><title>Complex & intelligent systems</title><addtitle>Complex Intell. Syst</addtitle><description>We suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. The balance optimization subset selection problem, which formulates minimization of aggregate imbalance in covariate distributions to reduce bias in data, is a new area of study in operations research. We investigate a novel metric, cross-validated area under the receiver operating characteristic curve (AUC) as a measure of balance between treatment and control groups. The proposed approach provides direct and automatic balancing of covariate distributions. In addition, the AUC-based approach is able to detect subtler distributional differences than existing measures, such as simple empirical mean/variance and count-based metrics. Thus, optimizing AUCs achieves a greater balance than the existing methods. Using 5 widely used real data sets and 7 synthetic data sets, we show that optimization of samples using existing methods (Chi-square, mean variance differences, Kolmogorov–Smirnov, and Mahalanobis) results in samples containing imbalance that is detectable using machine learning ensembles. We minimize covariate imbalance by minimizing the absolute value of the distance of the maximum cross-validated AUC on
M
folds from 0.50, using evolutionary optimization. We demonstrate that particle swarm optimization (PSO) outperforms modified cuckoo swarm (MCS) for a gradient-free, non-linear noisy cost function. To compute AUCs, we use supervised binary classification approaches from the machine learning and credit scoring literature. Using superscore ensembles adds to the classifier-based two-sample testing literature. If the mean cross-validated AUC based on machine learning is 0.50, the two groups are indistinguishable and suitable for causal inference.</description><subject>Automatic control</subject><subject>Complexity</subject><subject>Computational Intelligence</subject><subject>Cost function</subject><subject>Data Structures and Information Theory</subject><subject>Datasets</subject><subject>Engineering</subject><subject>Inference</subject><subject>Machine learning</subject><subject>Operations research</subject><subject>Optimization</subject><subject>Original Article</subject><subject>Particle swarm optimization</subject><subject>Tempering</subject><issn>2199-4536</issn><issn>2198-6053</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNp9kEtLxDAQx4MouKz7BTwFPEfTvHOUxRcIe9FzSLPTtUub1qRl0U9vdyt48zTDzP8BP4SuC3pbUKrvsqBaaEIZJZQWypLDGVqwwhqiqOTnp90SIbm6RKuc93RSaW04ZQsUN_1Qt77BeSwzDDhDA2Gou4irLuHgxzz96lhBghgAj7mOO9z68FFHwA34FI8HiBnasoGMfdzi3qehDg3gfPCpxd2xof72x9QrdFH5JsPqdy7R--PD2_qZvG6eXtb3ryQIrgZihdVMaSpKDUpKr5gKyrKtldqbrQHDjIKKS18ZUUhtjeSGWylkybmoZOBLdDPn9qn7HCEPbt-NKU6VjgnLtDFW6UnFZlVIXc4JKtenCUb6cgV1R7RuRusmtO6E1h0mE59NeRLHHaS_6H9cP1kkfT4</recordid><startdate>20210201</startdate><enddate>20210201</enddate><creator>Sharma, Dhruv</creator><creator>Willy, Christopher</creator><creator>Bischoff, John</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope></search><sort><creationdate>20210201</creationdate><title>Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization</title><author>Sharma, Dhruv ; Willy, Christopher ; Bischoff, John</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Automatic control</topic><topic>Complexity</topic><topic>Computational Intelligence</topic><topic>Cost function</topic><topic>Data Structures and Information Theory</topic><topic>Datasets</topic><topic>Engineering</topic><topic>Inference</topic><topic>Machine learning</topic><topic>Operations research</topic><topic>Optimization</topic><topic>Original Article</topic><topic>Particle swarm optimization</topic><topic>Tempering</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sharma, Dhruv</creatorcontrib><creatorcontrib>Willy, Christopher</creatorcontrib><creatorcontrib>Bischoff, John</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest advanced technologies & aerospace journals</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Complex & intelligent systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sharma, Dhruv</au><au>Willy, Christopher</au><au>Bischoff, John</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization</atitle><jtitle>Complex & intelligent systems</jtitle><stitle>Complex Intell. Syst</stitle><date>2021-02-01</date><risdate>2021</risdate><volume>7</volume><issue>1</issue><spage>41</spage><epage>59</epage><pages>41-59</pages><issn>2199-4536</issn><eissn>2198-6053</eissn><abstract>We suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. The balance optimization subset selection problem, which formulates minimization of aggregate imbalance in covariate distributions to reduce bias in data, is a new area of study in operations research. We investigate a novel metric, cross-validated area under the receiver operating characteristic curve (AUC) as a measure of balance between treatment and control groups. The proposed approach provides direct and automatic balancing of covariate distributions. In addition, the AUC-based approach is able to detect subtler distributional differences than existing measures, such as simple empirical mean/variance and count-based metrics. Thus, optimizing AUCs achieves a greater balance than the existing methods. Using 5 widely used real data sets and 7 synthetic data sets, we show that optimization of samples using existing methods (Chi-square, mean variance differences, Kolmogorov–Smirnov, and Mahalanobis) results in samples containing imbalance that is detectable using machine learning ensembles. We minimize covariate imbalance by minimizing the absolute value of the distance of the maximum cross-validated AUC on
M
folds from 0.50, using evolutionary optimization. We demonstrate that particle swarm optimization (PSO) outperforms modified cuckoo swarm (MCS) for a gradient-free, non-linear noisy cost function. To compute AUCs, we use supervised binary classification approaches from the machine learning and credit scoring literature. Using superscore ensembles adds to the classifier-based two-sample testing literature. If the mean cross-validated AUC based on machine learning is 0.50, the two groups are indistinguishable and suitable for causal inference.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s40747-020-00169-w</doi><tpages>19</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2199-4536 |
ispartof | Complex & intelligent systems, 2021-02, Vol.7 (1), p.41-59 |
issn | 2199-4536 2198-6053 |
language | eng |
recordid | cdi_proquest_journals_2492788967 |
source | Publicly Available Content Database; Springer Nature - SpringerLink Journals - Fully Open Access |
subjects | Automatic control Complexity Computational Intelligence Cost function Data Structures and Information Theory Datasets Engineering Inference Machine learning Operations research Optimization Original Article Particle swarm optimization Tempering |
title | Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T00%3A04%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Optimal%20subset%20selection%20for%20causal%20inference%20using%20machine%20learning%20ensembles%20and%20particle%20swarm%20optimization&rft.jtitle=Complex%20&%20intelligent%20systems&rft.au=Sharma,%20Dhruv&rft.date=2021-02-01&rft.volume=7&rft.issue=1&rft.spage=41&rft.epage=59&rft.pages=41-59&rft.issn=2199-4536&rft.eissn=2198-6053&rft_id=info:doi/10.1007/s40747-020-00169-w&rft_dat=%3Cproquest_cross%3E2492788967%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c436t-949726704b7e655a626c692d957a8d8e8286ef35af841579853839545b334f5c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2492788967&rft_id=info:pmid/&rfr_iscdi=true |