Loading…

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding

With the rapid development of evaluation datasets to assess LLMs understanding across a wide range of subjects and domains, identifying a suitable language understanding benchmark has become increasingly challenging. In this work, we explore LLM evaluation challenges for low-resource language unders...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-11
Main Authors: Israel Abebe Azime, Tonja, Atnafu Lambebo, Tadesse Destaw Belay, Yonas Chanie, Bontu Fufa Balcha, Negasi Haile Abadi, Ademtew, Henok Biadglign, Mulubrhan, Abebe Nerea, Yadeta, Debela Desalegn, Derartu Dagne Geremew, Assefa Atsbiha tesfau, Slusallek, Philipp, Solorio, Thamar, Klakow, Dietrich
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the rapid development of evaluation datasets to assess LLMs understanding across a wide range of subjects and domains, identifying a suitable language understanding benchmark has become increasingly challenging. In this work, we explore LLM evaluation challenges for low-resource language understanding and introduce ProverbEval, LLM evaluation benchmark for low-resource languages based on proverbs to focus on low-resource language understanding in culture-specific scenarios. We benchmark various LLMs and explore factors that create variability in the benchmarking process. We observed performance variances of up to 50%, depending on the order in which answer choices were presented in multiple-choice tasks. Native language proverb descriptions significantly improve tasks such as proverb generation, contributing to improved outcomes. Additionally, monolingual evaluations consistently outperformed their cross-lingual counterparts. We argue special attention must be given to the order of choices, choice of prompt language, task variability, and generation tasks when creating LLM evaluation benchmarks.
ISSN:2331-8422