Loading…
A dictionary based model for bengali document classification
Computer-aided documented content analysis is a prominent research area in natural language processing . A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is automated document classification ,...
Saved in:
Published in: | Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-06, Vol.53 (11), p.14023-14042 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Computer-aided documented content analysis is a prominent research area in
natural language processing
. A realistic implementation of this task is related to the subjectivity of the quantifiable data. One of the most interesting specialisations of this problem is
automated document classification
, which is a system that can identify the category of a document without human intervention. The problem of document classification has to consider the evaluation of the
heart-of-the-matter
of the textual material. Being one of the most-spoken languages in the world, a huge number of Bengali documents are present in digital form, and it is increasing rapidly due to the age of the internet. A document classification method is required to organise and categorise these huge documents rapidly and efficiently. In this paper, a decisive dictionary based model has been presented for the classification of documents in the Bengali text. We have introduced the concepts of lexiconid, lexiconaffinity, lexiconunicity, and lexiconassociation to acquire the features. The feature set is integrated with different levels of threshold. The proposed model is supervised, and the entire dataset has been split into testing and training sets. The proposed model has been validated using the
k
-fold cross-validation strategy. A significant number of dictionary based parameter values have been estimated for each token present in the text. In this paper, the text has been classified using a new rule based classification algorithm,
predictive lexicon inference
(PLI) classifier. The proposed model has been evaluated on five datasets: ParadiseLost, Iliad, Odyssey, Ramayana, and Mahabharata. In addition to document classification, this algorithm enables name entity classification, and chronology or description classification. |
---|---|
ISSN: | 0924-669X 1573-7497 |
DOI: | 10.1007/s10489-022-03955-w |