Loading…

Offensive language identification in dravidian languages using MPNet and CNN

•A thorough review of the techniques, algorithms, datasets, and tasks for offensive language detection in Dravidian languages.•Novel MPNet and CNN fusion technique for offensive language detection in low-resource Dravidian languages.•An extensive evaluation of benchmark datasets with positive result...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of information management data insights 2023-04, Vol.3 (1), p.100151, Article 100151
Main Authors:	Chakravarthi, Bharathi Raja, Jagadeeshan, Manoj Balaji, Palanikumar, Vasanth, Priyadharshini, Ruba
Format:	Article
Language:	English
Subjects:	CNN Code-mixing Deep learning Dravidian languages MPNet Offensive language identification
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•A thorough review of the techniques, algorithms, datasets, and tasks for offensive language detection in Dravidian languages.•Novel MPNet and CNN fusion technique for offensive language detection in low-resource Dravidian languages.•An extensive evaluation of benchmark datasets with positive results. Social media has effectively replaced traditional forms of communication and marketing. As these platforms allow for the free expression of ideas and facts through text, images, and videos, there exists a significant need to screen them to safeguard people and organisations from objectionable information directed at them. Our work aims to categorise code-mixed social media comments and posts in Tamil, Malayalam, and Kannada into offensive or not offensive at different levels. We present a multilingual MPNet and CNN fusion model for detecting offensive language content directed at an individual (or group) in low-resource Dravidian languages at different levels. Our model is capable of handling data that has been code-mixed, such as Tamil and Latin scripts. The model was successfully validated on the datasets, achieving offensive language detection results better than those of other baseline models with weighted average F1-score of 0.85, 0.98, and 0.76, and performed better than the baseline models EWDT, and EWODT by 0.02, 0.02, 0.04 for Tamil, Malayalam, and Kannada respectively.
ISSN:	2667-0968 2667-0968
DOI:	10.1016/j.jjimei.2022.100151