Loading…

Impact of preprocessing and word embedding on extreme multi-label patent classification tasks

Patent classification is a necessary step in the efficient processing of patent data and ensuring convenient information access to users. To address the present inefficiency of patent classification, many algorithms and deep learning-based techniques have been developed. However, there is a scarcity...

Full description

Saved in:
Bibliographic Details
Published in:Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-02, Vol.53 (4), p.4047-4062
Main Authors: Jung, Guik, Shin, Junghoon, Lee, Sangjun
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Patent classification is a necessary step in the efficient processing of patent data and ensuring convenient information access to users. To address the present inefficiency of patent classification, many algorithms and deep learning-based techniques have been developed. However, there is a scarcity of studies on the impacts of preprocessing, word embedding, and data fields on patent classification. In this study, we examined three different scenarios to evaluate and analyze the effects of generalizing words via stemming on the classification performance considering the characteristics of patent data. Comparative experiments between pre-trained word embedding models and embedding models that underwent learning using a newly created patent dataset were conducted. Detailed descriptions of the preprocessing and word embedding techniques are provided. We found that the continuous bag-of-words (CBoW) embedding model that underwent learning using the patent dataset best reflected the words contained in the patent documents, and the hierarchical International Patent Classification (IPC) that is used in more than 100 countries had the biggest impact on the classification performance. Furthermore, the relationship between the number of embedded words and the classification performance was investigated. Finally, we performed classification experiments using different data fields and classification models. When the IPC was incorporated, the classification performance was substantially enhanced, and a high classification accuracy was achieved when a classification model that considered the relationship between labels and words was employed. We used the most commonly used indices, P@N and NDCG@N, to compare the performance of all models. Using the model with the best performance as determined via the aforementioned experiments, accuracies of P @ 1 = 71.896%, P @ 3 = 36.697%, and P @ 5 = 24.301% were obtained using two simple ensembles of LAHA models. We provide an in-depth investigation into patent classification methods that elucidates the effects of various parameters on the patent classification process. The results of this study will serve to improve the efficiency of patent research and classification tasks.
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-022-03655-5