Loading…

Using score distributions to compare statistical significance tests for information retrieval evaluation

Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current method...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of the American Society for Information Science and Technology 2020-01, Vol.71 (1), p.98-113
Main Authors:	Parapar, Javier, Losada, David E., Presedo‐Quindimil, Manuel A., Barreiro, Alvaro
Format:	Article
Language:	English
Subjects:	Computer simulation Information retrieval Measurement Techniques Null hypothesis Permutations Regression analysis Reliability analysis Searching Semiotics Statistical analysis Statistical significance Statistical tests Test validity and reliability Truth
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43
cites	cdi_FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43
container_end_page	113
container_issue	1
container_start_page	98
container_title	Journal of the American Society for Information Science and Technology
container_volume	71
creator	Parapar, Javier Losada, David E. Presedo‐Quindimil, Manuel A. Barreiro, Alvaro
description	Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in IR evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t‐test. The sign test and Wilcoxon signed test also have good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.
doi_str_mv	10.1002/asi.24203
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1002_asi_24203</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2321275934</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43</originalsourceid><addsrcrecordid>eNp1kE1LAzEQhoMoWGoP_oOAJw_b5mPTzR5L8aMgeNCeQzZNakq7qZms0n9vtivevMwMM8-8M7wI3VIypYSwmQY_ZSUj_AKNGOekoPOSX_7VXFyjCcCOEEJJLQWjI_SxBt9uMZgQLd54SNE3XfKhBZwCNuFw1HkASac880bvMfht610uW2NxspAAuxCxb3M86H4VR5tl7FeG-9Cdmzfoyuk92MlvHqP148P78rl4eX1aLRcvhSkZ5YUTTlLCudVO17UQppJyvhGmkbU1xMlm04jKGu2cLlkjGqqlMZXRei6YqE3Jx-hu0D3G8Nnl99QudLHNJxXjjLJK1Lyn7gfKxAAQrVPH6A86nhQlqvdSZS_V2cvMzgb22-_t6X9QLd5Ww8YPQ3t4rA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2321275934</pqid></control><display><type>article</type><title>Using score distributions to compare statistical significance tests for information retrieval evaluation</title><source>Business Source Ultimate【Trial: -2024/12/31】【Remote access available】</source><source>Library & Information Science Abstracts (LISA)</source><source>Wiley-Blackwell Read & Publish Collection</source><creator>Parapar, Javier ; Losada, David E. ; Presedo‐Quindimil, Manuel A. ; Barreiro, Alvaro</creator><creatorcontrib>Parapar, Javier ; Losada, David E. ; Presedo‐Quindimil, Manuel A. ; Barreiro, Alvaro</creatorcontrib><description>Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in IR evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t‐test. The sign test and Wilcoxon signed test also have good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.</description><identifier>ISSN: 2330-1635</identifier><identifier>EISSN: 2330-1643</identifier><identifier>DOI: 10.1002/asi.24203</identifier><language>eng</language><publisher>Hoboken, USA: John Wiley & Sons, Inc</publisher><subject>Computer simulation ; Information retrieval ; Measurement Techniques ; Null hypothesis ; Permutations ; Regression analysis ; Reliability analysis ; Searching ; Semiotics ; Statistical analysis ; Statistical significance ; Statistical tests ; Test validity and reliability ; Truth</subject><ispartof>Journal of the American Society for Information Science and Technology, 2020-01, Vol.71 (1), p.98-113</ispartof><rights>2019 ASIS&T</rights><rights>2020 ASIS&T</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43</citedby><cites>FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43</cites><orcidid>0000-0001-8823-7501</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Parapar, Javier</creatorcontrib><creatorcontrib>Losada, David E.</creatorcontrib><creatorcontrib>Presedo‐Quindimil, Manuel A.</creatorcontrib><creatorcontrib>Barreiro, Alvaro</creatorcontrib><title>Using score distributions to compare statistical significance tests for information retrieval evaluation</title><title>Journal of the American Society for Information Science and Technology</title><description>Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in IR evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t‐test. The sign test and Wilcoxon signed test also have good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.</description><subject>Computer simulation</subject><subject>Information retrieval</subject><subject>Measurement Techniques</subject><subject>Null hypothesis</subject><subject>Permutations</subject><subject>Regression analysis</subject><subject>Reliability analysis</subject><subject>Searching</subject><subject>Semiotics</subject><subject>Statistical analysis</subject><subject>Statistical significance</subject><subject>Statistical tests</subject><subject>Test validity and reliability</subject><subject>Truth</subject><issn>2330-1635</issn><issn>2330-1643</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp1kE1LAzEQhoMoWGoP_oOAJw_b5mPTzR5L8aMgeNCeQzZNakq7qZms0n9vtivevMwMM8-8M7wI3VIypYSwmQY_ZSUj_AKNGOekoPOSX_7VXFyjCcCOEEJJLQWjI_SxBt9uMZgQLd54SNE3XfKhBZwCNuFw1HkASac880bvMfht610uW2NxspAAuxCxb3M86H4VR5tl7FeG-9Cdmzfoyuk92MlvHqP148P78rl4eX1aLRcvhSkZ5YUTTlLCudVO17UQppJyvhGmkbU1xMlm04jKGu2cLlkjGqqlMZXRei6YqE3Jx-hu0D3G8Nnl99QudLHNJxXjjLJK1Lyn7gfKxAAQrVPH6A86nhQlqvdSZS_V2cvMzgb22-_t6X9QLd5Ww8YPQ3t4rA</recordid><startdate>202001</startdate><enddate>202001</enddate><creator>Parapar, Javier</creator><creator>Losada, David E.</creator><creator>Presedo‐Quindimil, Manuel A.</creator><creator>Barreiro, Alvaro</creator><general>John Wiley & Sons, Inc</general><general>Wiley Periodicals Inc</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-8823-7501</orcidid></search><sort><creationdate>202001</creationdate><title>Using score distributions to compare statistical significance tests for information retrieval evaluation</title><author>Parapar, Javier ; Losada, David E. ; Presedo‐Quindimil, Manuel A. ; Barreiro, Alvaro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer simulation</topic><topic>Information retrieval</topic><topic>Measurement Techniques</topic><topic>Null hypothesis</topic><topic>Permutations</topic><topic>Regression analysis</topic><topic>Reliability analysis</topic><topic>Searching</topic><topic>Semiotics</topic><topic>Statistical analysis</topic><topic>Statistical significance</topic><topic>Statistical tests</topic><topic>Test validity and reliability</topic><topic>Truth</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Parapar, Javier</creatorcontrib><creatorcontrib>Losada, David E.</creatorcontrib><creatorcontrib>Presedo‐Quindimil, Manuel A.</creatorcontrib><creatorcontrib>Barreiro, Alvaro</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of the American Society for Information Science and Technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Parapar, Javier</au><au>Losada, David E.</au><au>Presedo‐Quindimil, Manuel A.</au><au>Barreiro, Alvaro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using score distributions to compare statistical significance tests for information retrieval evaluation</atitle><jtitle>Journal of the American Society for Information Science and Technology</jtitle><date>2020-01</date><risdate>2020</risdate><volume>71</volume><issue>1</issue><spage>98</spage><epage>113</epage><pages>98-113</pages><issn>2330-1635</issn><eissn>2330-1643</eissn><abstract>Statistical significance tests can provide evidence that the observed difference in performance between 2 methods is not due to chance. In information retrieval (IR), some studies have examined the validity and suitability of such tests for comparing search systems. We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in IR evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t‐test. The sign test and Wilcoxon signed test also have good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.</abstract><cop>Hoboken, USA</cop><pub>John Wiley & Sons, Inc</pub><doi>10.1002/asi.24203</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-8823-7501</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 2330-1635
ispartof	Journal of the American Society for Information Science and Technology, 2020-01, Vol.71 (1), p.98-113
issn	2330-1635 2330-1643
language	eng
recordid	cdi_crossref_primary_10_1002_asi_24203
source	Business Source Ultimate【Trial: -2024/12/31】【Remote access available】; Library & Information Science Abstracts (LISA); Wiley-Blackwell Read & Publish Collection
subjects	Computer simulation Information retrieval Measurement Techniques Null hypothesis Permutations Regression analysis Reliability analysis Searching Semiotics Statistical analysis Statistical significance Statistical tests Test validity and reliability Truth
title	Using score distributions to compare statistical significance tests for information retrieval evaluation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T12%3A58%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20score%20distributions%20to%20compare%20statistical%20significance%20tests%20for%20information%20retrieval%20evaluation&rft.jtitle=Journal%20of%20the%20American%20Society%20for%20Information%20Science%20and%20Technology&rft.au=Parapar,%20Javier&rft.date=2020-01&rft.volume=71&rft.issue=1&rft.spage=98&rft.epage=113&rft.pages=98-113&rft.issn=2330-1635&rft.eissn=2330-1643&rft_id=info:doi/10.1002/asi.24203&rft_dat=%3Cproquest_cross%3E2321275934%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4213-f5f81033eafa9955c7886d5cb89ec0f8bdb57ecaffa42b5b1a8cc7caa65259c43%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2321275934&rft_id=info:pmid/&rfr_iscdi=true