Loading…

Quantifying Variance in Evaluation Benchmarks

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively us...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-06
Main Authors:	Madaan, Lovish, Singh, Aaditya K, Schaeffer, Rylan, Poulton, Andrew, Koyejo, Sanmi, Stenetorp, Pontus, Narang, Sharan, Hupkes, Dieuwke
Format:	Article
Language:	English
Subjects:	Benchmarks Large language models
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Madaan, Lovish Singh, Aaditya K Schaeffer, Rylan Poulton, Andrew Koyejo, Sanmi Stenetorp, Pontus Narang, Sharan Hupkes, Dieuwke
description	Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3068911153</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3068911153</sourcerecordid><originalsourceid>FETCH-proquest_journals_30689111533</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQDSxNzCvJTKvMzEtXCEssykzMS05VyMxTcC1LzClNLMnMz1NwSs1LzshNLMou5mFgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeGMDMwtLQ0NDU2Nj4lQBANOJMiE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3068911153</pqid></control><display><type>article</type><title>Quantifying Variance in Evaluation Benchmarks</title><source>Publicly Available Content Database</source><creator>Madaan, Lovish ; Singh, Aaditya K ; Schaeffer, Rylan ; Poulton, Andrew ; Koyejo, Sanmi ; Stenetorp, Pontus ; Narang, Sharan ; Hupkes, Dieuwke</creator><creatorcontrib>Madaan, Lovish ; Singh, Aaditya K ; Schaeffer, Rylan ; Poulton, Andrew ; Koyejo, Sanmi ; Stenetorp, Pontus ; Narang, Sharan ; Hupkes, Dieuwke</creatorcontrib><description>Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Large language models</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3068911153?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Madaan, Lovish</creatorcontrib><creatorcontrib>Singh, Aaditya K</creatorcontrib><creatorcontrib>Schaeffer, Rylan</creatorcontrib><creatorcontrib>Poulton, Andrew</creatorcontrib><creatorcontrib>Koyejo, Sanmi</creatorcontrib><creatorcontrib>Stenetorp, Pontus</creatorcontrib><creatorcontrib>Narang, Sharan</creatorcontrib><creatorcontrib>Hupkes, Dieuwke</creatorcontrib><title>Quantifying Variance in Evaluation Benchmarks</title><title>arXiv.org</title><description>Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.</description><subject>Benchmarks</subject><subject>Large language models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQDSxNzCvJTKvMzEtXCEssykzMS05VyMxTcC1LzClNLMnMz1NwSs1LzshNLMou5mFgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeGMDMwtLQ0NDU2Nj4lQBANOJMiE</recordid><startdate>20240614</startdate><enddate>20240614</enddate><creator>Madaan, Lovish</creator><creator>Singh, Aaditya K</creator><creator>Schaeffer, Rylan</creator><creator>Poulton, Andrew</creator><creator>Koyejo, Sanmi</creator><creator>Stenetorp, Pontus</creator><creator>Narang, Sharan</creator><creator>Hupkes, Dieuwke</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240614</creationdate><title>Quantifying Variance in Evaluation Benchmarks</title><author>Madaan, Lovish ; Singh, Aaditya K ; Schaeffer, Rylan ; Poulton, Andrew ; Koyejo, Sanmi ; Stenetorp, Pontus ; Narang, Sharan ; Hupkes, Dieuwke</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30689111533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Large language models</topic><toplevel>online_resources</toplevel><creatorcontrib>Madaan, Lovish</creatorcontrib><creatorcontrib>Singh, Aaditya K</creatorcontrib><creatorcontrib>Schaeffer, Rylan</creatorcontrib><creatorcontrib>Poulton, Andrew</creatorcontrib><creatorcontrib>Koyejo, Sanmi</creatorcontrib><creatorcontrib>Stenetorp, Pontus</creatorcontrib><creatorcontrib>Narang, Sharan</creatorcontrib><creatorcontrib>Hupkes, Dieuwke</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Madaan, Lovish</au><au>Singh, Aaditya K</au><au>Schaeffer, Rylan</au><au>Poulton, Andrew</au><au>Koyejo, Sanmi</au><au>Stenetorp, Pontus</au><au>Narang, Sharan</au><au>Hupkes, Dieuwke</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Quantifying Variance in Evaluation Benchmarks</atitle><jtitle>arXiv.org</jtitle><date>2024-06-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully pretrained models, evaluation benchmarks are now also extensively used to decide between various training choices. Despite this widespread usage, we rarely quantify the variance in our evaluation benchmarks, which dictates whether differences in performance are meaningful. Here, we define and measure a range of metrics geared towards measuring variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. By studying a large number of models -- both openly available and pretrained from scratch -- we provide empirical estimates for a variety of variance metrics, with considerations and recommendations for practitioners. We also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance. We find that simple changes, such as framing choice tasks (like MMLU) as completion tasks, can often reduce variance for smaller scale ($\sim$7B) models, while more involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance. Overall, our work provides insights into variance in evaluation benchmarks, suggests LM-specific techniques to reduce variance, and more generally encourages practitioners to carefully factor in variance when comparing models.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3068911153
source	Publicly Available Content Database
subjects	Benchmarks Large language models
title	Quantifying Variance in Evaluation Benchmarks
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T16%3A16%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Quantifying%20Variance%20in%20Evaluation%20Benchmarks&rft.jtitle=arXiv.org&rft.au=Madaan,%20Lovish&rft.date=2024-06-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3068911153%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30689111533%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3068911153&rft_id=info:pmid/&rfr_iscdi=true