Loading…
A benchmarking study of classification techniques for behavioral data
The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classi...
Saved in:
Published in: | International journal of data science and analytics 2020-03, Vol.9 (2), p.131-173 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383 |
---|---|
cites | cdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383 |
container_end_page | 173 |
container_issue | 2 |
container_start_page | 131 |
container_title | International journal of data science and analytics |
container_volume | 9 |
creator | De Cnudde, Sofie Martens, David Evgeniou, Theodoros Provost, Foster |
description | The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data. |
doi_str_mv | 10.1007/s41060-019-00185-1 |
format | article |
fullrecord | <record><control><sourceid>crossref_sprin</sourceid><recordid>TN_cdi_crossref_primary_10_1007_s41060_019_00185_1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1007_s41060_019_00185_1</sourcerecordid><originalsourceid>FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</originalsourceid><addsrcrecordid>eNp9kMtKAzEUhoMoWGpfwFVeIHrO5DJxWUq9QMGNgruQyaVNrTOaTIW-vdGKS1fnLP7v5-cj5BLhCgHa6yIQFDDAGwaAWjI8IZOGK8EEKn3698uXczIrZQs11SoulZ6Q5Zx2oXebN5tfU7-mZdz7Ax0idTtbSorJ2TENPR2D2_TpYx8KjUOuzMZ-piHbHfV2tBfkLNpdCbPfOyXPt8unxT1bPd49LOYr5rjmI9Nec8sb5E5a0XkvNOrQKCmlELEDkMFhEzrftjJwG1vbKg_COS8aVF2tmJLm2OvyUEoO0bznVKcfDIL5dmGOLkx1YX5cGKwQP0Klhvt1yGY77HNfd_5HfQH2hmIO</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A benchmarking study of classification techniques for behavioral data</title><source>Springer Link</source><creator>De Cnudde, Sofie ; Martens, David ; Evgeniou, Theodoros ; Provost, Foster</creator><creatorcontrib>De Cnudde, Sofie ; Martens, David ; Evgeniou, Theodoros ; Provost, Foster</creatorcontrib><description>The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.</description><identifier>ISSN: 2364-415X</identifier><identifier>EISSN: 2364-4168</identifier><identifier>DOI: 10.1007/s41060-019-00185-1</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Artificial Intelligence ; Business Information Systems ; Computational Biology/Bioinformatics ; Computer Science ; Data Mining and Knowledge Discovery ; Database Management ; Regular Paper</subject><ispartof>International journal of data science and analytics, 2020-03, Vol.9 (2), p.131-173</ispartof><rights>Springer Nature Switzerland AG 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</citedby><cites>FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>De Cnudde, Sofie</creatorcontrib><creatorcontrib>Martens, David</creatorcontrib><creatorcontrib>Evgeniou, Theodoros</creatorcontrib><creatorcontrib>Provost, Foster</creatorcontrib><title>A benchmarking study of classification techniques for behavioral data</title><title>International journal of data science and analytics</title><addtitle>Int J Data Sci Anal</addtitle><description>The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.</description><subject>Artificial Intelligence</subject><subject>Business Information Systems</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Regular Paper</subject><issn>2364-415X</issn><issn>2364-4168</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKAzEUhoMoWGpfwFVeIHrO5DJxWUq9QMGNgruQyaVNrTOaTIW-vdGKS1fnLP7v5-cj5BLhCgHa6yIQFDDAGwaAWjI8IZOGK8EEKn3698uXczIrZQs11SoulZ6Q5Zx2oXebN5tfU7-mZdz7Ax0idTtbSorJ2TENPR2D2_TpYx8KjUOuzMZ-piHbHfV2tBfkLNpdCbPfOyXPt8unxT1bPd49LOYr5rjmI9Nec8sb5E5a0XkvNOrQKCmlELEDkMFhEzrftjJwG1vbKg_COS8aVF2tmJLm2OvyUEoO0bznVKcfDIL5dmGOLkx1YX5cGKwQP0Klhvt1yGY77HNfd_5HfQH2hmIO</recordid><startdate>20200301</startdate><enddate>20200301</enddate><creator>De Cnudde, Sofie</creator><creator>Martens, David</creator><creator>Evgeniou, Theodoros</creator><creator>Provost, Foster</creator><general>Springer International Publishing</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20200301</creationdate><title>A benchmarking study of classification techniques for behavioral data</title><author>De Cnudde, Sofie ; Martens, David ; Evgeniou, Theodoros ; Provost, Foster</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Artificial Intelligence</topic><topic>Business Information Systems</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Regular Paper</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>De Cnudde, Sofie</creatorcontrib><creatorcontrib>Martens, David</creatorcontrib><creatorcontrib>Evgeniou, Theodoros</creatorcontrib><creatorcontrib>Provost, Foster</creatorcontrib><collection>CrossRef</collection><jtitle>International journal of data science and analytics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>De Cnudde, Sofie</au><au>Martens, David</au><au>Evgeniou, Theodoros</au><au>Provost, Foster</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A benchmarking study of classification techniques for behavioral data</atitle><jtitle>International journal of data science and analytics</jtitle><stitle>Int J Data Sci Anal</stitle><date>2020-03-01</date><risdate>2020</risdate><volume>9</volume><issue>2</issue><spage>131</spage><epage>173</epage><pages>131-173</pages><issn>2364-415X</issn><eissn>2364-4168</eissn><abstract>The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s41060-019-00185-1</doi><tpages>43</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2364-415X |
ispartof | International journal of data science and analytics, 2020-03, Vol.9 (2), p.131-173 |
issn | 2364-415X 2364-4168 |
language | eng |
recordid | cdi_crossref_primary_10_1007_s41060_019_00185_1 |
source | Springer Link |
subjects | Artificial Intelligence Business Information Systems Computational Biology/Bioinformatics Computer Science Data Mining and Knowledge Discovery Database Management Regular Paper |
title | A benchmarking study of classification techniques for behavioral data |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T06%3A15%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_sprin&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20benchmarking%20study%20of%20classification%20techniques%20for%20behavioral%20data&rft.jtitle=International%20journal%20of%20data%20science%20and%20analytics&rft.au=De%20Cnudde,%20Sofie&rft.date=2020-03-01&rft.volume=9&rft.issue=2&rft.spage=131&rft.epage=173&rft.pages=131-173&rft.issn=2364-415X&rft.eissn=2364-4168&rft_id=info:doi/10.1007/s41060-019-00185-1&rft_dat=%3Ccrossref_sprin%3E10_1007_s41060_019_00185_1%3C/crossref_sprin%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |