Loading…

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologica...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-12
Main Authors:	Bayes, Edward, Israel Abebe Azime, Alabi, Jesujoba O, Kgomo, Jonas, Eloundou, Tyna, Proehl, Elizabeth, Chen, Kai, Khadir, Imaan, Etori, Naome A, Shamsuddeen Hassan Muhammad, Mpanza, Choice, Thete, Igneciah Pocia, Klakow, Dietrich, Adelani, David Ifeoluwa
Format:	Article
Language:	English
Subjects:	African languages Benchmarks Continuous improvement Datasets English language Large language models Performance evaluation R&D Research & development
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Bayes, Edward Israel Abebe Azime Alabi, Jesujoba O Kgomo, Jonas Eloundou, Tyna Proehl, Elizabeth Chen, Kai Khadir, Imaan Etori, Naome A Shamsuddeen Hassan Muhammad Mpanza, Choice Thete, Igneciah Pocia Klakow, Dietrich Adelani, David Ifeoluwa
description	Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3138994463</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3138994463</sourcerecordid><originalsourceid>FETCH-proquest_journals_31389944633</originalsourceid><addsrcrecordid>eNqNjcFqAkEQRIdAIJL4Dw05L6wzq9HcNmLwkIvGnKWZ9OyO2fTE7h39_azgB3gqqPeKujMj69ykmFfWPpix6qEsSzt7sdOpGxn5arPgK9TwRuzbX5QfCElgdcIuYx-5gU8fifsYoodNJu1jYqhZzyQXivwNO8l9G3LHpAqR4SOdiy1pyuIJ6iDR41AiNxkb0idzH7BTGl_z0Ty_r3bLdfEn6Xg52B-GJQ9o7yZuvlhU1cy526x_S3RLdw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3138994463</pqid></control><display><type>article</type><title>Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Bayes, Edward ; Israel Abebe Azime ; Alabi, Jesujoba O ; Kgomo, Jonas ; Eloundou, Tyna ; Proehl, Elizabeth ; Chen, Kai ; Khadir, Imaan ; Etori, Naome A ; Shamsuddeen Hassan Muhammad ; Mpanza, Choice ; Thete, Igneciah Pocia ; Klakow, Dietrich ; Adelani, David Ifeoluwa</creator><creatorcontrib>Bayes, Edward ; Israel Abebe Azime ; Alabi, Jesujoba O ; Kgomo, Jonas ; Eloundou, Tyna ; Proehl, Elizabeth ; Chen, Kai ; Khadir, Imaan ; Etori, Naome A ; Shamsuddeen Hassan Muhammad ; Mpanza, Choice ; Thete, Igneciah Pocia ; Klakow, Dietrich ; Adelani, David Ifeoluwa</creatorcontrib><description>Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>African languages ; Benchmarks ; Continuous improvement ; Datasets ; English language ; Large language models ; Performance evaluation ; R&D ; Research & development</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3138994463?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25751,37010,44588</link.rule.ids></links><search><creatorcontrib>Bayes, Edward</creatorcontrib><creatorcontrib>Israel Abebe Azime</creatorcontrib><creatorcontrib>Alabi, Jesujoba O</creatorcontrib><creatorcontrib>Kgomo, Jonas</creatorcontrib><creatorcontrib>Eloundou, Tyna</creatorcontrib><creatorcontrib>Proehl, Elizabeth</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Khadir, Imaan</creatorcontrib><creatorcontrib>Etori, Naome A</creatorcontrib><creatorcontrib>Shamsuddeen Hassan Muhammad</creatorcontrib><creatorcontrib>Mpanza, Choice</creatorcontrib><creatorcontrib>Thete, Igneciah Pocia</creatorcontrib><creatorcontrib>Klakow, Dietrich</creatorcontrib><creatorcontrib>Adelani, David Ifeoluwa</creatorcontrib><title>Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages</title><title>arXiv.org</title><description>Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.</description><subject>African languages</subject><subject>Benchmarks</subject><subject>Continuous improvement</subject><subject>Datasets</subject><subject>English language</subject><subject>Large language models</subject><subject>Performance evaluation</subject><subject>R&D</subject><subject>Research & development</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjcFqAkEQRIdAIJL4Dw05L6wzq9HcNmLwkIvGnKWZ9OyO2fTE7h39_azgB3gqqPeKujMj69ykmFfWPpix6qEsSzt7sdOpGxn5arPgK9TwRuzbX5QfCElgdcIuYx-5gU8fifsYoodNJu1jYqhZzyQXivwNO8l9G3LHpAqR4SOdiy1pyuIJ6iDR41AiNxkb0idzH7BTGl_z0Ty_r3bLdfEn6Xg52B-GJQ9o7yZuvlhU1cy526x_S3RLdw</recordid><startdate>20241201</startdate><enddate>20241201</enddate><creator>Bayes, Edward</creator><creator>Israel Abebe Azime</creator><creator>Alabi, Jesujoba O</creator><creator>Kgomo, Jonas</creator><creator>Eloundou, Tyna</creator><creator>Proehl, Elizabeth</creator><creator>Chen, Kai</creator><creator>Khadir, Imaan</creator><creator>Etori, Naome A</creator><creator>Shamsuddeen Hassan Muhammad</creator><creator>Mpanza, Choice</creator><creator>Thete, Igneciah Pocia</creator><creator>Klakow, Dietrich</creator><creator>Adelani, David Ifeoluwa</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241201</creationdate><title>Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages</title><author>Bayes, Edward ; Israel Abebe Azime ; Alabi, Jesujoba O ; Kgomo, Jonas ; Eloundou, Tyna ; Proehl, Elizabeth ; Chen, Kai ; Khadir, Imaan ; Etori, Naome A ; Shamsuddeen Hassan Muhammad ; Mpanza, Choice ; Thete, Igneciah Pocia ; Klakow, Dietrich ; Adelani, David Ifeoluwa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31389944633</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>African languages</topic><topic>Benchmarks</topic><topic>Continuous improvement</topic><topic>Datasets</topic><topic>English language</topic><topic>Large language models</topic><topic>Performance evaluation</topic><topic>R&D</topic><topic>Research & development</topic><toplevel>online_resources</toplevel><creatorcontrib>Bayes, Edward</creatorcontrib><creatorcontrib>Israel Abebe Azime</creatorcontrib><creatorcontrib>Alabi, Jesujoba O</creatorcontrib><creatorcontrib>Kgomo, Jonas</creatorcontrib><creatorcontrib>Eloundou, Tyna</creatorcontrib><creatorcontrib>Proehl, Elizabeth</creatorcontrib><creatorcontrib>Chen, Kai</creatorcontrib><creatorcontrib>Khadir, Imaan</creatorcontrib><creatorcontrib>Etori, Naome A</creatorcontrib><creatorcontrib>Shamsuddeen Hassan Muhammad</creatorcontrib><creatorcontrib>Mpanza, Choice</creatorcontrib><creatorcontrib>Thete, Igneciah Pocia</creatorcontrib><creatorcontrib>Klakow, Dietrich</creatorcontrib><creatorcontrib>Adelani, David Ifeoluwa</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bayes, Edward</au><au>Israel Abebe Azime</au><au>Alabi, Jesujoba O</au><au>Kgomo, Jonas</au><au>Eloundou, Tyna</au><au>Proehl, Elizabeth</au><au>Chen, Kai</au><au>Khadir, Imaan</au><au>Etori, Naome A</au><au>Shamsuddeen Hassan Muhammad</au><au>Mpanza, Choice</au><au>Thete, Igneciah Pocia</au><au>Klakow, Dietrich</au><au>Adelani, David Ifeoluwa</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages</atitle><jtitle>arXiv.org</jtitle><date>2024-12-01</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3138994463
source	Publicly Available Content Database (Proquest) (PQ_SDU_P3)
subjects	African languages Benchmarks Continuous improvement Datasets English language Large language models Performance evaluation R&D Research & development
title	Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T12%3A27%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Uhura:%20A%20Benchmark%20for%20Evaluating%20Scientific%20Question%20Answering%20and%20Truthfulness%20in%20Low-Resource%20African%20Languages&rft.jtitle=arXiv.org&rft.au=Bayes,%20Edward&rft.date=2024-12-01&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3138994463%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31389944633%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3138994463&rft_id=info:pmid/&rfr_iscdi=true