Loading…

Metadata Might Make Language Models Better

This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and ge...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2022-11
Main Authors:	Beelen, Kaspar, Daniel van Strien
Format:	Article
Language:	English
Subjects:	Evaluation Metadata Rosin
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.
ISSN:	2331-8422