Loading…

Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques

Author Gender Identification (AGI) is an extensively studied subject owing to its significance in several domains, such as security and marketing. Recognizing an author’s gender may assist marketers in segmenting consumers more effectively and crafting tailored content that aligns with a gender’s pr...

Full description

Saved in:
Bibliographic Details
Published in:Information (Basel) 2024-12, Vol.15 (12), p.779
Main Authors: Himdi, Hanen, Shaalan, Khaled
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Author Gender Identification (AGI) is an extensively studied subject owing to its significance in several domains, such as security and marketing. Recognizing an author’s gender may assist marketers in segmenting consumers more effectively and crafting tailored content that aligns with a gender’s preferences. Also, in cybersecurity, identifying an author’s gender might aid in detecting phishing attempts where hackers could imitate individuals of a specific gender. Although studies in Arabic have mostly concentrated on written dialects, such as tweets, there is a paucity of studies addressing Modern Standard Arabic (MSA) in journalistic genres. To address the AGI issue, this work combines the beneficial properties of natural language processing with cutting-edge deep learning methods. Firstly, we propose a large 8k MSA article dataset composed of various columns sourced from news platforms, labeled with each author’s gender. Moreover, we extract and analyze textual features that may be beneficial in identifying gender-related cues through their writings, focusing on semantics and syntax linguistics. Furthermore, we probe several innovative deep learning models, namely, Convolutional Neural Networks (CNNs), LSTM, Bidirectional LSTM (BiLSTM), and Bidirectional Encoder Representations from Transformers (BERT). Beyond that, a novel enhanced BERT model is proposed by incorporating gender-specific textual features. Through various experiments, the results underscore the potential of both BERT and the textual features, resulting in a 91% accuracy for the enhanced BERT model and a range of accuracy from 80% to 90% accuracy for deep learning models. We also employ these features for AGI in informal, dialectal text, with the enhanced BERT model reaching 68.7% accuracy. This demonstrates that these gender-specific textual features are conducive to AGI across MSA and dialectal texts.
ISSN:2078-2489
2078-2489
DOI:10.3390/info15120779