Loading…

Language Models as Science Tutors

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-07
Main Authors: Chevalier, Alexis, Geng, Jiayi, Wettig, Alexander, Chen, Howard, Mizera, Sebastian, Annala, Toni, Max Jameson Aragon, Arturo RodrĂ­guez Fanlo, Frieder, Simon, Machado, Simon, Prabhakar, Akshara, Thieu, Ellie, Wang, Jiachen T, Wang, Zirui, Wu, Xindi, Xia, Mengzhou, Xia, Wenhan, Yu, Jiatong, Jun-Jie, Zhu, Ren, Zhiyong Jason, Arora, Sanjeev, Chen, Danqi
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.
ISSN:2331-8422