Loading…

A comparative evaluation of aggregation methods for machine learning over vertically partitioned data

•We compare aggregation methods for vertically partitioned data in several scenarios.•Impact of datasets characteristics over aggregators’ performance is investigated.•Silhouette and imbalance coefficient are the most influential characteristics.•Characteristics impact varies according to the specif...

Full description

Saved in:

Bibliographic Details
Published in:	Expert systems with applications 2020-08, Vol.152, p.113406, Article 113406
Main Authors:	Trevizan, Bernardo, Chamby-Diaz, Jorge, Bazzan, Ana L.C., Recamonde-Mendoza, Mariana
Format:	Article
Language:	English
Subjects:	Attribute-partitioned data Classification Distributed machine learning Machine learning Predictions aggregation Vertical data partitioning
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63
cites	cdi_FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63
container_end_page
container_issue
container_start_page	113406
container_title	Expert systems with applications
container_volume	152
creator	Trevizan, Bernardo Chamby-Diaz, Jorge Bazzan, Ana L.C. Recamonde-Mendoza, Mariana
description	•We compare aggregation methods for vertically partitioned data in several scenarios.•Impact of datasets characteristics over aggregators’ performance is investigated.•Silhouette and imbalance coefficient are the most influential characteristics.•Characteristics impact varies according to the specific scenario.•Decision and regression trees are trained to guide the aggregator choice. It is increasingly common applications where data are naturally generated in a distributed fashion, especially after the emergence of technologies like the Internet of Things (IoT). In sensor networks, in collaborative health or genomic projects, in credit risk analysis, among other domains, distinct features are collected from multiple sources, including the use of social media and mobile applications, and due to privacy concerns or communication costs, may not be shared among sites. This scenario of vertical data partitioning poses challenges to traditional machine learning (ML) approaches, as classical algorithms are designed to learn from the complete set of features. A common strategy is to combine predictions from local models trained at each site into a global model, and for this purpose, several aggregation methods have been proposed. In this work we tackle a gap within the related literature, performing a comparative evaluation of elementary and meta-learning-based aggregation methods to reveal their strengths and weakness for 46 datasets with varied characteristics. We show that no method outperforms its counterparts in all domains, emphasizing the need for experimental comparison to ensure a good choice in the domain of interest. Moreover, our experiments provide the first insights into the relations between datasets’ properties and aggregators’ performance. We show that for low class imbalance and a good instance-to-feature ratio, almost all aggregation methods tend to perform well. The silhouette coefficient (reflecting class separability) and class imbalance coefficient are the most influential properties on aggregators’ performance, thus we recommend their analysis in the first step of the methodological design. We found that arithmetic-based methods are not suitable for datasets with poor class separability and a large number of classes, whereas meta-learning approaches are less sensitive for datasets with silhouette coefficient close to 0. Our analyses were summarized as classification and regression trees, which have the impact to serve as practical tools for
doi_str_mv	10.1016/j.eswa.2020.113406
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2437431906</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S095741742030230X</els_id><sourcerecordid>2437431906</sourcerecordid><originalsourceid>FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63</originalsourceid><addsrcrecordid>eNp9kE1Lw0AQhhdRsFb_gKcFz6mzm49NwEspfkHBi56XyWbSbkiydTet9N-bEM8ehvlg3neGh7F7ASsBIntsVhR-cCVBjgMRJ5BdsIXIVRxlqogv2QKKVEWJUMk1uwmhARAKQC0Yrblx3QE9DvZEnE7YHsfS9dzVHHc7T7u57WjYuyrw2nneodnbnnhL6Hvb77g7kedjDNZg25756DfYSUYVr3DAW3ZVYxvo7i8v2dfL8-fmLdp-vL5v1tvIxDIfIgEKpDIgqrgo0sLkCGlZoihKg6VKjACAPDMkM5SgyjqtpDSKjKEyFTlm8ZI9zL4H776PFAbduKPvx5NaJrFKYlHAtCXnLeNdCJ5qffC2Q3_WAvSEUzd6wqknnHrGOYqeZhGN_58seR2Mpd5QZT2ZQVfO_if_BYN_f44</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2437431906</pqid></control><display><type>article</type><title>A comparative evaluation of aggregation methods for machine learning over vertically partitioned data</title><source>ScienceDirect Freedom Collection 2022-2024</source><creator>Trevizan, Bernardo ; Chamby-Diaz, Jorge ; Bazzan, Ana L.C. ; Recamonde-Mendoza, Mariana</creator><creatorcontrib>Trevizan, Bernardo ; Chamby-Diaz, Jorge ; Bazzan, Ana L.C. ; Recamonde-Mendoza, Mariana</creatorcontrib><description>•We compare aggregation methods for vertically partitioned data in several scenarios.•Impact of datasets characteristics over aggregators’ performance is investigated.•Silhouette and imbalance coefficient are the most influential characteristics.•Characteristics impact varies according to the specific scenario.•Decision and regression trees are trained to guide the aggregator choice. It is increasingly common applications where data are naturally generated in a distributed fashion, especially after the emergence of technologies like the Internet of Things (IoT). In sensor networks, in collaborative health or genomic projects, in credit risk analysis, among other domains, distinct features are collected from multiple sources, including the use of social media and mobile applications, and due to privacy concerns or communication costs, may not be shared among sites. This scenario of vertical data partitioning poses challenges to traditional machine learning (ML) approaches, as classical algorithms are designed to learn from the complete set of features. A common strategy is to combine predictions from local models trained at each site into a global model, and for this purpose, several aggregation methods have been proposed. In this work we tackle a gap within the related literature, performing a comparative evaluation of elementary and meta-learning-based aggregation methods to reveal their strengths and weakness for 46 datasets with varied characteristics. We show that no method outperforms its counterparts in all domains, emphasizing the need for experimental comparison to ensure a good choice in the domain of interest. Moreover, our experiments provide the first insights into the relations between datasets’ properties and aggregators’ performance. We show that for low class imbalance and a good instance-to-feature ratio, almost all aggregation methods tend to perform well. The silhouette coefficient (reflecting class separability) and class imbalance coefficient are the most influential properties on aggregators’ performance, thus we recommend their analysis in the first step of the methodological design. We found that arithmetic-based methods are not suitable for datasets with poor class separability and a large number of classes, whereas meta-learning approaches are less sensitive for datasets with silhouette coefficient close to 0. Our analyses were summarized as classification and regression trees, which have the impact to serve as practical tools for future research. Taken together, our findings give rise to interesting applications in the domain of intelligent systems, especially regarding their potential to reduce the burden of vast experimental comparisons when training ML models with feature-partitioned data.</description><identifier>ISSN: 0957-4174</identifier><identifier>EISSN: 1873-6793</identifier><identifier>DOI: 10.1016/j.eswa.2020.113406</identifier><language>eng</language><publisher>New York: Elsevier Ltd</publisher><subject>Attribute-partitioned data ; Classification ; Distributed machine learning ; Machine learning ; Predictions aggregation ; Vertical data partitioning</subject><ispartof>Expert systems with applications, 2020-08, Vol.152, p.113406, Article 113406</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright Elsevier BV Aug 15, 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63</citedby><cites>FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63</cites><orcidid>0000-0001-6765-2650 ; 0000-0003-2800-1032</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Trevizan, Bernardo</creatorcontrib><creatorcontrib>Chamby-Diaz, Jorge</creatorcontrib><creatorcontrib>Bazzan, Ana L.C.</creatorcontrib><creatorcontrib>Recamonde-Mendoza, Mariana</creatorcontrib><title>A comparative evaluation of aggregation methods for machine learning over vertically partitioned data</title><title>Expert systems with applications</title><description>•We compare aggregation methods for vertically partitioned data in several scenarios.•Impact of datasets characteristics over aggregators’ performance is investigated.•Silhouette and imbalance coefficient are the most influential characteristics.•Characteristics impact varies according to the specific scenario.•Decision and regression trees are trained to guide the aggregator choice. It is increasingly common applications where data are naturally generated in a distributed fashion, especially after the emergence of technologies like the Internet of Things (IoT). In sensor networks, in collaborative health or genomic projects, in credit risk analysis, among other domains, distinct features are collected from multiple sources, including the use of social media and mobile applications, and due to privacy concerns or communication costs, may not be shared among sites. This scenario of vertical data partitioning poses challenges to traditional machine learning (ML) approaches, as classical algorithms are designed to learn from the complete set of features. A common strategy is to combine predictions from local models trained at each site into a global model, and for this purpose, several aggregation methods have been proposed. In this work we tackle a gap within the related literature, performing a comparative evaluation of elementary and meta-learning-based aggregation methods to reveal their strengths and weakness for 46 datasets with varied characteristics. We show that no method outperforms its counterparts in all domains, emphasizing the need for experimental comparison to ensure a good choice in the domain of interest. Moreover, our experiments provide the first insights into the relations between datasets’ properties and aggregators’ performance. We show that for low class imbalance and a good instance-to-feature ratio, almost all aggregation methods tend to perform well. The silhouette coefficient (reflecting class separability) and class imbalance coefficient are the most influential properties on aggregators’ performance, thus we recommend their analysis in the first step of the methodological design. We found that arithmetic-based methods are not suitable for datasets with poor class separability and a large number of classes, whereas meta-learning approaches are less sensitive for datasets with silhouette coefficient close to 0. Our analyses were summarized as classification and regression trees, which have the impact to serve as practical tools for future research. Taken together, our findings give rise to interesting applications in the domain of intelligent systems, especially regarding their potential to reduce the burden of vast experimental comparisons when training ML models with feature-partitioned data.</description><subject>Attribute-partitioned data</subject><subject>Classification</subject><subject>Distributed machine learning</subject><subject>Machine learning</subject><subject>Predictions aggregation</subject><subject>Vertical data partitioning</subject><issn>0957-4174</issn><issn>1873-6793</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kE1Lw0AQhhdRsFb_gKcFz6mzm49NwEspfkHBi56XyWbSbkiydTet9N-bEM8ehvlg3neGh7F7ASsBIntsVhR-cCVBjgMRJ5BdsIXIVRxlqogv2QKKVEWJUMk1uwmhARAKQC0Yrblx3QE9DvZEnE7YHsfS9dzVHHc7T7u57WjYuyrw2nneodnbnnhL6Hvb77g7kedjDNZg25756DfYSUYVr3DAW3ZVYxvo7i8v2dfL8-fmLdp-vL5v1tvIxDIfIgEKpDIgqrgo0sLkCGlZoihKg6VKjACAPDMkM5SgyjqtpDSKjKEyFTlm8ZI9zL4H776PFAbduKPvx5NaJrFKYlHAtCXnLeNdCJ5qffC2Q3_WAvSEUzd6wqknnHrGOYqeZhGN_58seR2Mpd5QZT2ZQVfO_if_BYN_f44</recordid><startdate>20200815</startdate><enddate>20200815</enddate><creator>Trevizan, Bernardo</creator><creator>Chamby-Diaz, Jorge</creator><creator>Bazzan, Ana L.C.</creator><creator>Recamonde-Mendoza, Mariana</creator><general>Elsevier Ltd</general><general>Elsevier BV</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-6765-2650</orcidid><orcidid>https://orcid.org/0000-0003-2800-1032</orcidid></search><sort><creationdate>20200815</creationdate><title>A comparative evaluation of aggregation methods for machine learning over vertically partitioned data</title><author>Trevizan, Bernardo ; Chamby-Diaz, Jorge ; Bazzan, Ana L.C. ; Recamonde-Mendoza, Mariana</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Attribute-partitioned data</topic><topic>Classification</topic><topic>Distributed machine learning</topic><topic>Machine learning</topic><topic>Predictions aggregation</topic><topic>Vertical data partitioning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Trevizan, Bernardo</creatorcontrib><creatorcontrib>Chamby-Diaz, Jorge</creatorcontrib><creatorcontrib>Bazzan, Ana L.C.</creatorcontrib><creatorcontrib>Recamonde-Mendoza, Mariana</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Expert systems with applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Trevizan, Bernardo</au><au>Chamby-Diaz, Jorge</au><au>Bazzan, Ana L.C.</au><au>Recamonde-Mendoza, Mariana</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A comparative evaluation of aggregation methods for machine learning over vertically partitioned data</atitle><jtitle>Expert systems with applications</jtitle><date>2020-08-15</date><risdate>2020</risdate><volume>152</volume><spage>113406</spage><pages>113406-</pages><artnum>113406</artnum><issn>0957-4174</issn><eissn>1873-6793</eissn><abstract>•We compare aggregation methods for vertically partitioned data in several scenarios.•Impact of datasets characteristics over aggregators’ performance is investigated.•Silhouette and imbalance coefficient are the most influential characteristics.•Characteristics impact varies according to the specific scenario.•Decision and regression trees are trained to guide the aggregator choice. It is increasingly common applications where data are naturally generated in a distributed fashion, especially after the emergence of technologies like the Internet of Things (IoT). In sensor networks, in collaborative health or genomic projects, in credit risk analysis, among other domains, distinct features are collected from multiple sources, including the use of social media and mobile applications, and due to privacy concerns or communication costs, may not be shared among sites. This scenario of vertical data partitioning poses challenges to traditional machine learning (ML) approaches, as classical algorithms are designed to learn from the complete set of features. A common strategy is to combine predictions from local models trained at each site into a global model, and for this purpose, several aggregation methods have been proposed. In this work we tackle a gap within the related literature, performing a comparative evaluation of elementary and meta-learning-based aggregation methods to reveal their strengths and weakness for 46 datasets with varied characteristics. We show that no method outperforms its counterparts in all domains, emphasizing the need for experimental comparison to ensure a good choice in the domain of interest. Moreover, our experiments provide the first insights into the relations between datasets’ properties and aggregators’ performance. We show that for low class imbalance and a good instance-to-feature ratio, almost all aggregation methods tend to perform well. The silhouette coefficient (reflecting class separability) and class imbalance coefficient are the most influential properties on aggregators’ performance, thus we recommend their analysis in the first step of the methodological design. We found that arithmetic-based methods are not suitable for datasets with poor class separability and a large number of classes, whereas meta-learning approaches are less sensitive for datasets with silhouette coefficient close to 0. Our analyses were summarized as classification and regression trees, which have the impact to serve as practical tools for future research. Taken together, our findings give rise to interesting applications in the domain of intelligent systems, especially regarding their potential to reduce the burden of vast experimental comparisons when training ML models with feature-partitioned data.</abstract><cop>New York</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.eswa.2020.113406</doi><orcidid>https://orcid.org/0000-0001-6765-2650</orcidid><orcidid>https://orcid.org/0000-0003-2800-1032</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0957-4174
ispartof	Expert systems with applications, 2020-08, Vol.152, p.113406, Article 113406
issn	0957-4174 1873-6793
language	eng
recordid	cdi_proquest_journals_2437431906
source	ScienceDirect Freedom Collection 2022-2024
subjects	Attribute-partitioned data Classification Distributed machine learning Machine learning Predictions aggregation Vertical data partitioning
title	A comparative evaluation of aggregation methods for machine learning over vertically partitioned data
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T05%3A28%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20comparative%20evaluation%20of%20aggregation%20methods%20for%20machine%20learning%20over%20vertically%20partitioned%20data&rft.jtitle=Expert%20systems%20with%20applications&rft.au=Trevizan,%20Bernardo&rft.date=2020-08-15&rft.volume=152&rft.spage=113406&rft.pages=113406-&rft.artnum=113406&rft.issn=0957-4174&rft.eissn=1873-6793&rft_id=info:doi/10.1016/j.eswa.2020.113406&rft_dat=%3Cproquest_cross%3E2437431906%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c328t-107027c01d39959c8a05bba19bcab74c100086ce26a207bf5d22c7ecceb518a63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2437431906&rft_id=info:pmid/&rfr_iscdi=true