Loading…

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-11
Main Authors: Wijk, Hjalmar, Lin, Tao, Becker, Joel, Jawhar, Sami, Parikh, Neev, Broadley, Thomas, Chan, Lawrence, Chen, Michael, Clymer, Josh, Dhyani, Jai, Ericheva, Elena, Garcia, Katharyn, Goodrich, Brian, Jurkovic, Nikola, Kinniment, Megan, Lajko, Aron, Nix, Seraphina, Sato, Lucas, Saunders, William, Taran, Maksym, West, Ben, Barnes, Elizabeth
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Wijk, Hjalmar
Lin, Tao
Becker, Joel
Jawhar, Sami
Parikh, Neev
Broadley, Thomas
Chan, Lawrence
Chen, Michael
Clymer, Josh
Dhyani, Jai
Ericheva, Elena
Garcia, Katharyn
Goodrich, Brian
Jurkovic, Nikola
Kinniment, Megan
Lajko, Aron
Nix, Seraphina
Sato, Lucas
Saunders, William
Taran, Maksym
West, Ben
Barnes, Elizabeth
description Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3132697477</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3132697477</sourcerecordid><originalsourceid>FETCH-proquest_journals_31326974773</originalsourceid><addsrcrecordid>eNqNissKwjAUBYMgWNR_uCC4K9SkWnXno6Lb4r5eNW1T0qTmIX6-WfgBruYwZwYkoowt4nVK6YhMrW2TJKGrjC6XLCK3Io_3XD2aLeRvlB6dUDVURisnuIHdBYr5ER7Y411IEZwFXYFEVXusOXT6ySWEpZwNQKGsg8Z3qIB_em6cnZBhhdLy6Y9jMjvl18M57o1-eW5d2WpvVLhKtmB0tcnSLGP_VV862kQF</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3132697477</pqid></control><display><type>article</type><title>RE-Bench: Evaluating frontier AI R&amp;D capabilities of language model agents against human experts</title><source>Publicly Available Content Database</source><creator>Wijk, Hjalmar ; Lin, Tao ; Becker, Joel ; Jawhar, Sami ; Parikh, Neev ; Broadley, Thomas ; Chan, Lawrence ; Chen, Michael ; Clymer, Josh ; Dhyani, Jai ; Ericheva, Elena ; Garcia, Katharyn ; Goodrich, Brian ; Jurkovic, Nikola ; Kinniment, Megan ; Lajko, Aron ; Nix, Seraphina ; Sato, Lucas ; Saunders, William ; Taran, Maksym ; West, Ben ; Barnes, Elizabeth</creator><creatorcontrib>Wijk, Hjalmar ; Lin, Tao ; Becker, Joel ; Jawhar, Sami ; Parikh, Neev ; Broadley, Thomas ; Chan, Lawrence ; Chen, Michael ; Clymer, Josh ; Dhyani, Jai ; Ericheva, Elena ; Garcia, Katharyn ; Goodrich, Brian ; Jurkovic, Nikola ; Kinniment, Megan ; Lajko, Aron ; Nix, Seraphina ; Sato, Lucas ; Saunders, William ; Taran, Maksym ; West, Ben ; Barnes, Elizabeth</creatorcontrib><description>Frontier AI safety policies highlight automation of AI research and development (R&amp;D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&amp;D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Budgets ; Human performance ; Performance evaluation ; R&amp;D ; Research &amp; development ; Source code</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3132697477?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25744,37003,44581</link.rule.ids></links><search><creatorcontrib>Wijk, Hjalmar</creatorcontrib><creatorcontrib>Lin, Tao</creatorcontrib><creatorcontrib>Becker, Joel</creatorcontrib><creatorcontrib>Jawhar, Sami</creatorcontrib><creatorcontrib>Parikh, Neev</creatorcontrib><creatorcontrib>Broadley, Thomas</creatorcontrib><creatorcontrib>Chan, Lawrence</creatorcontrib><creatorcontrib>Chen, Michael</creatorcontrib><creatorcontrib>Clymer, Josh</creatorcontrib><creatorcontrib>Dhyani, Jai</creatorcontrib><creatorcontrib>Ericheva, Elena</creatorcontrib><creatorcontrib>Garcia, Katharyn</creatorcontrib><creatorcontrib>Goodrich, Brian</creatorcontrib><creatorcontrib>Jurkovic, Nikola</creatorcontrib><creatorcontrib>Kinniment, Megan</creatorcontrib><creatorcontrib>Lajko, Aron</creatorcontrib><creatorcontrib>Nix, Seraphina</creatorcontrib><creatorcontrib>Sato, Lucas</creatorcontrib><creatorcontrib>Saunders, William</creatorcontrib><creatorcontrib>Taran, Maksym</creatorcontrib><creatorcontrib>West, Ben</creatorcontrib><creatorcontrib>Barnes, Elizabeth</creatorcontrib><title>RE-Bench: Evaluating frontier AI R&amp;D capabilities of language model agents against human experts</title><title>arXiv.org</title><description>Frontier AI safety policies highlight automation of AI research and development (R&amp;D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&amp;D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.</description><subject>Budgets</subject><subject>Human performance</subject><subject>Performance evaluation</subject><subject>R&amp;D</subject><subject>Research &amp; development</subject><subject>Source code</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNissKwjAUBYMgWNR_uCC4K9SkWnXno6Lb4r5eNW1T0qTmIX6-WfgBruYwZwYkoowt4nVK6YhMrW2TJKGrjC6XLCK3Io_3XD2aLeRvlB6dUDVURisnuIHdBYr5ER7Y411IEZwFXYFEVXusOXT6ySWEpZwNQKGsg8Z3qIB_em6cnZBhhdLy6Y9jMjvl18M57o1-eW5d2WpvVLhKtmB0tcnSLGP_VV862kQF</recordid><startdate>20241122</startdate><enddate>20241122</enddate><creator>Wijk, Hjalmar</creator><creator>Lin, Tao</creator><creator>Becker, Joel</creator><creator>Jawhar, Sami</creator><creator>Parikh, Neev</creator><creator>Broadley, Thomas</creator><creator>Chan, Lawrence</creator><creator>Chen, Michael</creator><creator>Clymer, Josh</creator><creator>Dhyani, Jai</creator><creator>Ericheva, Elena</creator><creator>Garcia, Katharyn</creator><creator>Goodrich, Brian</creator><creator>Jurkovic, Nikola</creator><creator>Kinniment, Megan</creator><creator>Lajko, Aron</creator><creator>Nix, Seraphina</creator><creator>Sato, Lucas</creator><creator>Saunders, William</creator><creator>Taran, Maksym</creator><creator>West, Ben</creator><creator>Barnes, Elizabeth</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241122</creationdate><title>RE-Bench: Evaluating frontier AI R&amp;D capabilities of language model agents against human experts</title><author>Wijk, Hjalmar ; Lin, Tao ; Becker, Joel ; Jawhar, Sami ; Parikh, Neev ; Broadley, Thomas ; Chan, Lawrence ; Chen, Michael ; Clymer, Josh ; Dhyani, Jai ; Ericheva, Elena ; Garcia, Katharyn ; Goodrich, Brian ; Jurkovic, Nikola ; Kinniment, Megan ; Lajko, Aron ; Nix, Seraphina ; Sato, Lucas ; Saunders, William ; Taran, Maksym ; West, Ben ; Barnes, Elizabeth</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31326974773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Budgets</topic><topic>Human performance</topic><topic>Performance evaluation</topic><topic>R&amp;D</topic><topic>Research &amp; development</topic><topic>Source code</topic><toplevel>online_resources</toplevel><creatorcontrib>Wijk, Hjalmar</creatorcontrib><creatorcontrib>Lin, Tao</creatorcontrib><creatorcontrib>Becker, Joel</creatorcontrib><creatorcontrib>Jawhar, Sami</creatorcontrib><creatorcontrib>Parikh, Neev</creatorcontrib><creatorcontrib>Broadley, Thomas</creatorcontrib><creatorcontrib>Chan, Lawrence</creatorcontrib><creatorcontrib>Chen, Michael</creatorcontrib><creatorcontrib>Clymer, Josh</creatorcontrib><creatorcontrib>Dhyani, Jai</creatorcontrib><creatorcontrib>Ericheva, Elena</creatorcontrib><creatorcontrib>Garcia, Katharyn</creatorcontrib><creatorcontrib>Goodrich, Brian</creatorcontrib><creatorcontrib>Jurkovic, Nikola</creatorcontrib><creatorcontrib>Kinniment, Megan</creatorcontrib><creatorcontrib>Lajko, Aron</creatorcontrib><creatorcontrib>Nix, Seraphina</creatorcontrib><creatorcontrib>Sato, Lucas</creatorcontrib><creatorcontrib>Saunders, William</creatorcontrib><creatorcontrib>Taran, Maksym</creatorcontrib><creatorcontrib>West, Ben</creatorcontrib><creatorcontrib>Barnes, Elizabeth</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wijk, Hjalmar</au><au>Lin, Tao</au><au>Becker, Joel</au><au>Jawhar, Sami</au><au>Parikh, Neev</au><au>Broadley, Thomas</au><au>Chan, Lawrence</au><au>Chen, Michael</au><au>Clymer, Josh</au><au>Dhyani, Jai</au><au>Ericheva, Elena</au><au>Garcia, Katharyn</au><au>Goodrich, Brian</au><au>Jurkovic, Nikola</au><au>Kinniment, Megan</au><au>Lajko, Aron</au><au>Nix, Seraphina</au><au>Sato, Lucas</au><au>Saunders, William</au><au>Taran, Maksym</au><au>West, Ben</au><au>Barnes, Elizabeth</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>RE-Bench: Evaluating frontier AI R&amp;D capabilities of language model agents against human experts</atitle><jtitle>arXiv.org</jtitle><date>2024-11-22</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Frontier AI safety policies highlight automation of AI research and development (R&amp;D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&amp;D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_3132697477
source Publicly Available Content Database
subjects Budgets
Human performance
Performance evaluation
R&D
Research & development
Source code
title RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T05%3A20%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=RE-Bench:%20Evaluating%20frontier%20AI%20R&D%20capabilities%20of%20language%20model%20agents%20against%20human%20experts&rft.jtitle=arXiv.org&rft.au=Wijk,%20Hjalmar&rft.date=2024-11-22&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3132697477%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31326974773%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3132697477&rft_id=info:pmid/&rfr_iscdi=true