Loading…

Generalization Ability of CNN-Based Morpheme Segmentation

Determining the morphemic structure of a word is a problem that is particularly relevant in teaching the Russian language. Automatic evaluation of this structure is complicated by the lack of agreement among linguists in some complex cases. At the same time, several papers have been published in rec...

Full description

Saved in:

Bibliographic Details
Main Authors:	Garipov, Timur, Morozov, Dmitry, Glazkova, Anna
Format:	Conference Proceeding
Language:	English
Subjects:	Labeling Machine learning algorithms Robustness Training Training data Transfer learning Transformers
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Determining the morphemic structure of a word is a problem that is particularly relevant in teaching the Russian language. Automatic evaluation of this structure is complicated by the lack of agreement among linguists in some complex cases. At the same time, several papers have been published in recent years, whose authors use various machine learning methods to solve this problem in applications. The authors of [1] propose an architecture based on convolutional neural networks for Russian lemmas. The proposed algorithm has shown quality sufficient for solving various applied problems. At the same time, generalization ability of this algorithm in case of unmet morphemes remains unclear. In this paper, we discovered that quality of the algorithm drops by 16-18% in terms of word accuracy when testing on words with roots absent from the training sample. Taking into account the significant robustness of the algorithm to a uniform reduction in the training sample, we can conclude that training dataset for studied model can be small but should be as diverse as possible.
ISSN:	2767-9535
DOI:	10.1109/ISPRAS60948.2023.10508171