Loading…

Keyword extraction as sequence labeling with classification algorithms

Keyword extraction is one of the main problems in clustering and linking textual content. In literature, several machine learning approaches were proposed for keyword and keyphrase extraction. However, the state-of-the-art performance results are still below the expectations. In this paper, we propo...

Full description

Saved in:

Bibliographic Details
Published in:	Neural computing & applications 2023-02, Vol.35 (4), p.3413-3422
Main Authors:	Kılıç Ünlü, Hüma, Çetin, Aydın
Format:	Article
Language:	English
Subjects:	Algorithms Artificial Intelligence Classification Clustering Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Datasets Image Processing and Computer Vision Information retrieval Keywords Labeling Machine learning Multilayer perceptrons Multilayers Original Article Polynomials Probability and Statistics in Computer Science Support vector machines
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Keyword extraction is one of the main problems in clustering and linking textual content. In literature, several machine learning approaches were proposed for keyword and keyphrase extraction. However, the state-of-the-art performance results are still below the expectations. In this paper, we propose a novel hybrid keyword extraction model, HybridKEM. The proposed model addresses the keyword extraction problem as a sequence labelling task. Naive Bayes (NB), Polynomial Regression (PR) Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and Random Forest (RF) classification algorithms were trained separately in the Token Classification module of the model. The Token Classification process was performed by using text, graphic, embedding, and set features in the model. The performance of the model was evaluated using the Inspec, Semeval-2017, 500N-KPCrowd datasets, which are widely used in studies in the literature, and two newly collected, TRDizinEn and DergiParkEn datasets. The model achieved an average F 1 -score of 0.664 for all datasets. The highest F 1 -score (0.74) was obtained with the TRDizinEn dataset.
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-022-07906-x