Loading…

On the efficiency of data collection for multiple Naïve Bayes classifiers

Many classification problems are solved by aggregating the output of a group of distinct predictors. In this respect, a popular choice is to assume independence and employ a Naïve Bayes classifier. When we have not just one but multiple classification problems at the same time, the question of how t...

Full description

Saved in:
Bibliographic Details
Published in:Artificial intelligence 2019-10, Vol.275, p.356-378
Main Authors: Manino, Edoardo, Tran-Thanh, Long, Jennings, Nicholas R.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843
cites cdi_FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843
container_end_page 378
container_issue
container_start_page 356
container_title Artificial intelligence
container_volume 275
creator Manino, Edoardo
Tran-Thanh, Long
Jennings, Nicholas R.
description Many classification problems are solved by aggregating the output of a group of distinct predictors. In this respect, a popular choice is to assume independence and employ a Naïve Bayes classifier. When we have not just one but multiple classification problems at the same time, the question of how to assign the limited pool of available predictors to the individual classification problems arises. Empirical studies show that the policies we use to perform such assignments have a strong impact on the accuracy of the system. However, to date there is little theoretical understanding of this phenomenon. To help rectify this, in this paper we provide the first theoretical explanation of the accuracy gap between the most popular policies: the non-adaptive uniform allocation, and the adaptive allocation schemes based on uncertainty sampling and information gain maximisation. To do so, we propose a novel representation of the data collection process in terms of random walks. Then, we use this tool to derive new lower and upper bounds on the accuracy of the policies. These bounds reveal that the tradeoff between the number of available predictors and the accuracy has a different exponential rate depending on the policy used. By comparing them, we are able to quantify the advantage that the two adaptive policies have over the non-adaptive one for the first time, and prove that the probability of error of the former decays at more than double the exponential rate of the latter. Furthermore, we show in our analysis that this result holds both in the case where we know the accuracy of each individual predictor, and in the case where we only have access to a noisy estimate of it.
doi_str_mv 10.1016/j.artint.2019.06.010
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2311933141</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0004370218305551</els_id><sourcerecordid>2311933141</sourcerecordid><originalsourceid>FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843</originalsourceid><addsrcrecordid>eNp9kM1KAzEUhYMoWKtv4CLgesbcZH43ghZ_KXaj65BmbjDDdFKTtNCn8iF8MVPr2tXlwDnncj5CLoHlwKC67nPlox1jzhm0OatyBuyITKCpeVa3HI7JhDFWZKJm_JSchdAnKdoWJuRlMdL4gRSNsdriqHfUGdqpqKh2w4A6WjdS4zxdbYZo1wPSV_X9tUV6p3YYqB5UCNZY9OGcnBg1BLz4u1Py_nD_NnvK5ovH59ntPNOiYTErS1BMcVWWbQWKY1N1dVVwUXMh9FIUpemU0aLVpiwUB0yKQdF1S4UcVVOIKbk69K69-9xgiLJ3Gz-ml5ILgFYIKCC5ioNLexeCRyPX3q6U30lgck9N9vJATe6pSVbJRC3Fbg4xTAu2aZYMv1iwsz7BkJ2z_xf8AG59eGE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2311933141</pqid></control><display><type>article</type><title>On the efficiency of data collection for multiple Naïve Bayes classifiers</title><source>ScienceDirect Journals</source><creator>Manino, Edoardo ; Tran-Thanh, Long ; Jennings, Nicholas R.</creator><creatorcontrib>Manino, Edoardo ; Tran-Thanh, Long ; Jennings, Nicholas R.</creatorcontrib><description>Many classification problems are solved by aggregating the output of a group of distinct predictors. In this respect, a popular choice is to assume independence and employ a Naïve Bayes classifier. When we have not just one but multiple classification problems at the same time, the question of how to assign the limited pool of available predictors to the individual classification problems arises. Empirical studies show that the policies we use to perform such assignments have a strong impact on the accuracy of the system. However, to date there is little theoretical understanding of this phenomenon. To help rectify this, in this paper we provide the first theoretical explanation of the accuracy gap between the most popular policies: the non-adaptive uniform allocation, and the adaptive allocation schemes based on uncertainty sampling and information gain maximisation. To do so, we propose a novel representation of the data collection process in terms of random walks. Then, we use this tool to derive new lower and upper bounds on the accuracy of the policies. These bounds reveal that the tradeoff between the number of available predictors and the accuracy has a different exponential rate depending on the policy used. By comparing them, we are able to quantify the advantage that the two adaptive policies have over the non-adaptive one for the first time, and prove that the probability of error of the former decays at more than double the exponential rate of the latter. Furthermore, we show in our analysis that this result holds both in the case where we know the accuracy of each individual predictor, and in the case where we only have access to a noisy estimate of it.</description><identifier>ISSN: 0004-3702</identifier><identifier>EISSN: 1872-7921</identifier><identifier>DOI: 10.1016/j.artint.2019.06.010</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Accuracy ; Active learning ; Adaptive sampling ; Classification ; Classifiers ; Crowdsourcing ; Data collection ; Decay rate ; Empirical analysis ; Naïve Bayes ; Policies ; Random walk ; Upper bounds</subject><ispartof>Artificial intelligence, 2019-10, Vol.275, p.356-378</ispartof><rights>2019 Elsevier B.V.</rights><rights>Copyright Elsevier Science Ltd. Oct 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843</citedby><cites>FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Manino, Edoardo</creatorcontrib><creatorcontrib>Tran-Thanh, Long</creatorcontrib><creatorcontrib>Jennings, Nicholas R.</creatorcontrib><title>On the efficiency of data collection for multiple Naïve Bayes classifiers</title><title>Artificial intelligence</title><description>Many classification problems are solved by aggregating the output of a group of distinct predictors. In this respect, a popular choice is to assume independence and employ a Naïve Bayes classifier. When we have not just one but multiple classification problems at the same time, the question of how to assign the limited pool of available predictors to the individual classification problems arises. Empirical studies show that the policies we use to perform such assignments have a strong impact on the accuracy of the system. However, to date there is little theoretical understanding of this phenomenon. To help rectify this, in this paper we provide the first theoretical explanation of the accuracy gap between the most popular policies: the non-adaptive uniform allocation, and the adaptive allocation schemes based on uncertainty sampling and information gain maximisation. To do so, we propose a novel representation of the data collection process in terms of random walks. Then, we use this tool to derive new lower and upper bounds on the accuracy of the policies. These bounds reveal that the tradeoff between the number of available predictors and the accuracy has a different exponential rate depending on the policy used. By comparing them, we are able to quantify the advantage that the two adaptive policies have over the non-adaptive one for the first time, and prove that the probability of error of the former decays at more than double the exponential rate of the latter. Furthermore, we show in our analysis that this result holds both in the case where we know the accuracy of each individual predictor, and in the case where we only have access to a noisy estimate of it.</description><subject>Accuracy</subject><subject>Active learning</subject><subject>Adaptive sampling</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Crowdsourcing</subject><subject>Data collection</subject><subject>Decay rate</subject><subject>Empirical analysis</subject><subject>Naïve Bayes</subject><subject>Policies</subject><subject>Random walk</subject><subject>Upper bounds</subject><issn>0004-3702</issn><issn>1872-7921</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEUhYMoWKtv4CLgesbcZH43ghZ_KXaj65BmbjDDdFKTtNCn8iF8MVPr2tXlwDnncj5CLoHlwKC67nPlox1jzhm0OatyBuyITKCpeVa3HI7JhDFWZKJm_JSchdAnKdoWJuRlMdL4gRSNsdriqHfUGdqpqKh2w4A6WjdS4zxdbYZo1wPSV_X9tUV6p3YYqB5UCNZY9OGcnBg1BLz4u1Py_nD_NnvK5ovH59ntPNOiYTErS1BMcVWWbQWKY1N1dVVwUXMh9FIUpemU0aLVpiwUB0yKQdF1S4UcVVOIKbk69K69-9xgiLJ3Gz-ml5ILgFYIKCC5ioNLexeCRyPX3q6U30lgck9N9vJATe6pSVbJRC3Fbg4xTAu2aZYMv1iwsz7BkJ2z_xf8AG59eGE</recordid><startdate>201910</startdate><enddate>201910</enddate><creator>Manino, Edoardo</creator><creator>Tran-Thanh, Long</creator><creator>Jennings, Nicholas R.</creator><general>Elsevier B.V</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201910</creationdate><title>On the efficiency of data collection for multiple Naïve Bayes classifiers</title><author>Manino, Edoardo ; Tran-Thanh, Long ; Jennings, Nicholas R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Accuracy</topic><topic>Active learning</topic><topic>Adaptive sampling</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Crowdsourcing</topic><topic>Data collection</topic><topic>Decay rate</topic><topic>Empirical analysis</topic><topic>Naïve Bayes</topic><topic>Policies</topic><topic>Random walk</topic><topic>Upper bounds</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Manino, Edoardo</creatorcontrib><creatorcontrib>Tran-Thanh, Long</creatorcontrib><creatorcontrib>Jennings, Nicholas R.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Artificial intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Manino, Edoardo</au><au>Tran-Thanh, Long</au><au>Jennings, Nicholas R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the efficiency of data collection for multiple Naïve Bayes classifiers</atitle><jtitle>Artificial intelligence</jtitle><date>2019-10</date><risdate>2019</risdate><volume>275</volume><spage>356</spage><epage>378</epage><pages>356-378</pages><issn>0004-3702</issn><eissn>1872-7921</eissn><abstract>Many classification problems are solved by aggregating the output of a group of distinct predictors. In this respect, a popular choice is to assume independence and employ a Naïve Bayes classifier. When we have not just one but multiple classification problems at the same time, the question of how to assign the limited pool of available predictors to the individual classification problems arises. Empirical studies show that the policies we use to perform such assignments have a strong impact on the accuracy of the system. However, to date there is little theoretical understanding of this phenomenon. To help rectify this, in this paper we provide the first theoretical explanation of the accuracy gap between the most popular policies: the non-adaptive uniform allocation, and the adaptive allocation schemes based on uncertainty sampling and information gain maximisation. To do so, we propose a novel representation of the data collection process in terms of random walks. Then, we use this tool to derive new lower and upper bounds on the accuracy of the policies. These bounds reveal that the tradeoff between the number of available predictors and the accuracy has a different exponential rate depending on the policy used. By comparing them, we are able to quantify the advantage that the two adaptive policies have over the non-adaptive one for the first time, and prove that the probability of error of the former decays at more than double the exponential rate of the latter. Furthermore, we show in our analysis that this result holds both in the case where we know the accuracy of each individual predictor, and in the case where we only have access to a noisy estimate of it.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.artint.2019.06.010</doi><tpages>23</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0004-3702
ispartof Artificial intelligence, 2019-10, Vol.275, p.356-378
issn 0004-3702
1872-7921
language eng
recordid cdi_proquest_journals_2311933141
source ScienceDirect Journals
subjects Accuracy
Active learning
Adaptive sampling
Classification
Classifiers
Crowdsourcing
Data collection
Decay rate
Empirical analysis
Naïve Bayes
Policies
Random walk
Upper bounds
title On the efficiency of data collection for multiple Naïve Bayes classifiers
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T14%3A57%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20efficiency%20of%20data%20collection%20for%20multiple%20Na%C3%AFve%20Bayes%20classifiers&rft.jtitle=Artificial%20intelligence&rft.au=Manino,%20Edoardo&rft.date=2019-10&rft.volume=275&rft.spage=356&rft.epage=378&rft.pages=356-378&rft.issn=0004-3702&rft.eissn=1872-7921&rft_id=info:doi/10.1016/j.artint.2019.06.010&rft_dat=%3Cproquest_cross%3E2311933141%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c380t-551a0a2a55961a2e86d764237233cb345fdafc39cf54a21edaf014ddbae2ea843%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2311933141&rft_id=info:pmid/&rfr_iscdi=true