Loading…

Six-Granularity Based Chinese Short Text Classification

Short text classification is an important task in Natural Language Processing (NLP). The classification result for Chinese short text is always not ideal due to the sparsity problem of them. Most of the previous classification models for Chinese short text are based on word or character, considering...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2023-01, Vol.11, p.1-1
Main Authors: Sun, Xinjie, Liu, ZhiFang, Huo, Xingying
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Short text classification is an important task in Natural Language Processing (NLP). The classification result for Chinese short text is always not ideal due to the sparsity problem of them. Most of the previous classification models for Chinese short text are based on word or character, considering that Chinese radical can also represent the meaning individually, so word, character and radical are all used to build a Chinese short text classification model in this paper, which solves the data sparsity problem of short text. In addition, in the process of segmenting sentences into words, considering that jieba will cause the loss of key information and ngram will generate noise words, both jieba and ngram are used to construct a six-granularity (i.e. word-jieba, word-jieba-radical, word-ngram, word-ngram-radical, character and character-radical) based Chinese short text classification (SGCSTC) model. Additionally, different weights are assigned to the six granularities and are automatically updated in the process of back-propagation using cross-entropy loss due to the different influence of them on the classification results. The classification Accuracy, Precision, Recall and F1 of SGCSTC in THUCNews-S dataset are 93.36%, 94.47%, 94.15% and 94.31% respectively, and that in CNT dataset are 92.67%, 92.38%, 93.15% and 92.76% respectively, and multiple comparative experiment results on THUCNews-S and CNT datasets show that SGCSTC outperforms the state-of-the-art text classification models.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2023.3265712