Loading…

A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer

Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, a...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on computational social systems 2024-06, Vol.11 (3), p.4139-4148
Main Authors:	Suresh, Tharun, Sengupta, Ayan, Akhtar, Md Shad, Chakraborty, Tanmoy
Format:	Article
Language:	English
Subjects:	Code-mixed classification Datasets hierarchical attention Language Languages Machine translation Morphology Multilingualism Representation learning Representations Semantics Tagging Task analysis Transformers Vectors Zero-shot learning zero-shot learning (ZSL)
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.
ISSN:	2329-924X 2329-924X 2373-7476
DOI:	10.1109/TCSS.2024.3360378