Loading…

A dictionary based model for bengali document classification

Computer-aided documented content analysis is a prominent research area in natural language processing . A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is automated document classification ,...

Full description

Saved in:
Bibliographic Details
Published in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-06, Vol.53 (11), p.14023-14042
Main Authors: Das Dawn, Debapratim, Khan, Abhinandan, Shaikh, Soharab Hossain, Pal, Rajat Kumar
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Computer-aided documented content analysis is a prominent research area in natural language processing . A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is automated document classification , which is a system that can identify the category of a document without human intervention. The problem of document classification has to consider the evaluation of the heart-of-the-matter of the textual material. Being one of the most-spoken languages in the world, a huge number of Bengali documents are present in digital form, and it is increasing rapidly due to the age of the internet. A document classification method is required to organise and categorise these huge documents rapidly and efficiently. In this paper, a decisive dictionary based model has been presented for the classification of documents in the Bengali text. We have introduced the concepts of lexiconid, lexiconaffinity, lexiconunicity, and lexiconassociation to acquire the features. The feature set is integrated with different levels of threshold. The proposed model is supervised, and the entire dataset has been split into testing and training sets. The proposed model has been validated using the k -fold cross-validation strategy. A significant number of dictionary based parameter values have been estimated for each token present in the text. In this paper, the text has been classified using a new rule based classification algorithm, predictive lexicon inference (PLI) classifier. The proposed model has been evaluated on five datasets: ParadiseLost, Iliad, Odyssey, Ramayana, and Mahabharata. In addition to document classification, this algorithm enables name entity classification, and chronology or description classification.
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-022-03955-w