Loading…
Tokenization Stability Index: A Catalyst for Optimizing Transformer Models for Low Resource Languages
Texts from low-resource languages, including those from the Dravidian language family, are characterized by complex morphological structures that can substantially challenge large language models. While transformer models have proven effective in numerous applications, morphological features make lo...
Saved in:
Published in: | KSII transactions on Internet and information systems 2024, 18(11), , pp.3109-3128 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Texts from low-resource languages, including those from the Dravidian language family, are characterized by complex morphological structures that can substantially challenge large language models. While transformer models have proven effective in numerous applications, morphological features make low-resource languages less represented. To address this problem, we present the Tokenization Stability Index (TSI), a new metric that objectively captures the differences and similarities between tokenization techniques. TSI assesses token stability, the degree of vocabulary integration, multi-token matching, and the overall rate of all tokens versus unique tokens. We offer a robust mathematical overview, theoretical implications, and case studies to show that TSI creates a reliable framework for improving low-resource language transformer models. Custom tokenization techniques were developed and tested on Tamil-based text inputs. The modified BERT model significantly surpassed the baseline and IndicBERT models, illustrating further potential for refining tokenization frameworks to enhance text processing accuracy on Dravidian-based languages and low-resource languages. |
---|---|
ISSN: | 1976-7277 1976-7277 |
DOI: | 10.3837/tils.2024.11.001 |