Loading…

A benchmarking study of classification techniques for behavioral data

The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classi...

Full description

Saved in:
Bibliographic Details
Published in:International journal of data science and analytics 2020-03, Vol.9 (2), p.131-173
Main Authors: De Cnudde, Sofie, Martens, David, Evgeniou, Theodoros, Provost, Foster
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383
cites cdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383
container_end_page 173
container_issue 2
container_start_page 131
container_title International journal of data science and analytics
container_volume 9
creator De Cnudde, Sofie
Martens, David
Evgeniou, Theodoros
Provost, Foster
description The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.
doi_str_mv 10.1007/s41060-019-00185-1
format article
fullrecord <record><control><sourceid>crossref_sprin</sourceid><recordid>TN_cdi_crossref_primary_10_1007_s41060_019_00185_1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1007_s41060_019_00185_1</sourcerecordid><originalsourceid>FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</originalsourceid><addsrcrecordid>eNp9kMtKAzEUhoMoWGpfwFVeIHrO5DJxWUq9QMGNgruQyaVNrTOaTIW-vdGKS1fnLP7v5-cj5BLhCgHa6yIQFDDAGwaAWjI8IZOGK8EEKn3698uXczIrZQs11SoulZ6Q5Zx2oXebN5tfU7-mZdz7Ax0idTtbSorJ2TENPR2D2_TpYx8KjUOuzMZ-piHbHfV2tBfkLNpdCbPfOyXPt8unxT1bPd49LOYr5rjmI9Nec8sb5E5a0XkvNOrQKCmlELEDkMFhEzrftjJwG1vbKg_COS8aVF2tmJLm2OvyUEoO0bznVKcfDIL5dmGOLkx1YX5cGKwQP0Klhvt1yGY77HNfd_5HfQH2hmIO</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A benchmarking study of classification techniques for behavioral data</title><source>Springer Link</source><creator>De Cnudde, Sofie ; Martens, David ; Evgeniou, Theodoros ; Provost, Foster</creator><creatorcontrib>De Cnudde, Sofie ; Martens, David ; Evgeniou, Theodoros ; Provost, Foster</creatorcontrib><description>The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.</description><identifier>ISSN: 2364-415X</identifier><identifier>EISSN: 2364-4168</identifier><identifier>DOI: 10.1007/s41060-019-00185-1</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Artificial Intelligence ; Business Information Systems ; Computational Biology/Bioinformatics ; Computer Science ; Data Mining and Knowledge Discovery ; Database Management ; Regular Paper</subject><ispartof>International journal of data science and analytics, 2020-03, Vol.9 (2), p.131-173</ispartof><rights>Springer Nature Switzerland AG 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</citedby><cites>FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>De Cnudde, Sofie</creatorcontrib><creatorcontrib>Martens, David</creatorcontrib><creatorcontrib>Evgeniou, Theodoros</creatorcontrib><creatorcontrib>Provost, Foster</creatorcontrib><title>A benchmarking study of classification techniques for behavioral data</title><title>International journal of data science and analytics</title><addtitle>Int J Data Sci Anal</addtitle><description>The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.</description><subject>Artificial Intelligence</subject><subject>Business Information Systems</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Regular Paper</subject><issn>2364-415X</issn><issn>2364-4168</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKAzEUhoMoWGpfwFVeIHrO5DJxWUq9QMGNgruQyaVNrTOaTIW-vdGKS1fnLP7v5-cj5BLhCgHa6yIQFDDAGwaAWjI8IZOGK8EEKn3698uXczIrZQs11SoulZ6Q5Zx2oXebN5tfU7-mZdz7Ax0idTtbSorJ2TENPR2D2_TpYx8KjUOuzMZ-piHbHfV2tBfkLNpdCbPfOyXPt8unxT1bPd49LOYr5rjmI9Nec8sb5E5a0XkvNOrQKCmlELEDkMFhEzrftjJwG1vbKg_COS8aVF2tmJLm2OvyUEoO0bznVKcfDIL5dmGOLkx1YX5cGKwQP0Klhvt1yGY77HNfd_5HfQH2hmIO</recordid><startdate>20200301</startdate><enddate>20200301</enddate><creator>De Cnudde, Sofie</creator><creator>Martens, David</creator><creator>Evgeniou, Theodoros</creator><creator>Provost, Foster</creator><general>Springer International Publishing</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20200301</creationdate><title>A benchmarking study of classification techniques for behavioral data</title><author>De Cnudde, Sofie ; Martens, David ; Evgeniou, Theodoros ; Provost, Foster</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Artificial Intelligence</topic><topic>Business Information Systems</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Regular Paper</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>De Cnudde, Sofie</creatorcontrib><creatorcontrib>Martens, David</creatorcontrib><creatorcontrib>Evgeniou, Theodoros</creatorcontrib><creatorcontrib>Provost, Foster</creatorcontrib><collection>CrossRef</collection><jtitle>International journal of data science and analytics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>De Cnudde, Sofie</au><au>Martens, David</au><au>Evgeniou, Theodoros</au><au>Provost, Foster</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A benchmarking study of classification techniques for behavioral data</atitle><jtitle>International journal of data science and analytics</jtitle><stitle>Int J Data Sci Anal</stitle><date>2020-03-01</date><risdate>2020</risdate><volume>9</volume><issue>2</issue><spage>131</spage><epage>173</epage><pages>131-173</pages><issn>2364-415X</issn><eissn>2364-4168</eissn><abstract>The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s41060-019-00185-1</doi><tpages>43</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2364-415X
ispartof International journal of data science and analytics, 2020-03, Vol.9 (2), p.131-173
issn 2364-415X
2364-4168
language eng
recordid cdi_crossref_primary_10_1007_s41060_019_00185_1
source Springer Link
subjects Artificial Intelligence
Business Information Systems
Computational Biology/Bioinformatics
Computer Science
Data Mining and Knowledge Discovery
Database Management
Regular Paper
title A benchmarking study of classification techniques for behavioral data
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T06%3A15%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_sprin&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20benchmarking%20study%20of%20classification%20techniques%20for%20behavioral%20data&rft.jtitle=International%20journal%20of%20data%20science%20and%20analytics&rft.au=De%20Cnudde,%20Sofie&rft.date=2020-03-01&rft.volume=9&rft.issue=2&rft.spage=131&rft.epage=173&rft.pages=131-173&rft.issn=2364-415X&rft.eissn=2364-4168&rft_id=info:doi/10.1007/s41060-019-00185-1&rft_dat=%3Ccrossref_sprin%3E10_1007_s41060_019_00185_1%3C/crossref_sprin%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c383t-8d83a3213c5a4bdd4818e2655544fb005ec12ebd775e3af7a76d04ccd4216b383%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true