Loading…

Tokenization Stability Index: A Catalyst for Optimizing Transformer Models for Low Resource Languages

Texts from low-resource languages, including those from the Dravidian language family, are characterized by complex morphological structures that can substantially challenge large language models. While transformer models have proven effective in numerous applications, morphological features make lo...

Full description

Saved in:
Bibliographic Details
Published in:KSII transactions on Internet and information systems 2024, 18(11), , pp.3109-3128
Main Authors: Venkatesan, N, Arulanand, N
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Texts from low-resource languages, including those from the Dravidian language family, are characterized by complex morphological structures that can substantially challenge large language models. While transformer models have proven effective in numerous applications, morphological features make low-resource languages less represented. To address this problem, we present the Tokenization Stability Index (TSI), a new metric that objectively captures the differences and similarities between tokenization techniques. TSI assesses token stability, the degree of vocabulary integration, multi-token matching, and the overall rate of all tokens versus unique tokens. We offer a robust mathematical overview, theoretical implications, and case studies to show that TSI creates a reliable framework for improving low-resource language transformer models. Custom tokenization techniques were developed and tested on Tamil-based text inputs. The modified BERT model significantly surpassed the baseline and IndicBERT models, illustrating further potential for refining tokenization frameworks to enhance text processing accuracy on Dravidian-based languages and low-resource languages.
ISSN:1976-7277
1976-7277
DOI:10.3837/tils.2024.11.001