Loading…

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Efficient methods for storing and querying are critical for scaling high-order -gram language models to large corpora. We propose a language model based on , a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabil...

Full description

Saved in:
Bibliographic Details
Published in:Transactions of the Association for Computational Linguistics 2021-03, Vol.4, p.477-490
Main Authors: Shareghi, Ehsan, Petri, Matthias, Haffari, Gholamreza, Cohn, Trevor
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Efficient methods for storing and querying are critical for scaling high-order -gram language models to large corpora. We propose a language model based on , a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500Ă—, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).
ISSN:2307-387X
2307-387X
DOI:10.1162/tacl_a_00112