Loading…

PLBR: A Semi-Supervised Document Key Information Extraction via Pseudo-Labeling Bias Rectification

Document key information extraction (DKIE) methods often require a large number of labeled samples, imposing substantial annotation costs in practical scenarios. Fortunately, pseudo-labeling based semi-supervised learning (PSSL) algorithms provide an effective paradigm to alleviate the reliance on l...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on knowledge and data engineering 2024-12, Vol.36 (12), p.9025-9036
Main Authors:	Guo, Pengcheng, Song, Yonghong, Wang, Boyu, Liu, Jiaohao, Zhang, Qi
Format:	Article
Language:	English
Subjects:	Accuracy Adaptation models Benchmark testing bias rectification Contrastive learning Information extraction Information retrieval inter-class variance intra-class variance semi-supervised Semisupervised learning Task analysis
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Document key information extraction (DKIE) methods often require a large number of labeled samples, imposing substantial annotation costs in practical scenarios. Fortunately, pseudo-labeling based semi-supervised learning (PSSL) algorithms provide an effective paradigm to alleviate the reliance on labeled data by leveraging unlabeled data. However, the main challenges for PSSL in DKIE tasks: 1) context dependency of DKIE results in incorrect pseudo-labels. 2) high intra-class variance and low inter-class variation on DKIE. To this end, this paper proposes a similarity matrix Pseudo-Label Bias Rectification (PLBR) semi-supervised method for DKIE tasks, which improves the quality of pseudo-labels on DKIE benchmarks with rare labels. More specifically, the Similarity Matrix Bias Rectification (SMBR) module is proposed to improve the quality of pseudo-labels, which utilizes the contextual information of DKIE data through the analysis of similarity between labeled and unlabeled data. Moreover, a dual branch adaptive alignment (DBAA) mechanism is designed to adaptively align intra-class variance and alleviate inter-class variation on DKIE benchmarks, which is composed of two adaptive alignment ways. One is the intra-class alignment branch, which is designed to adaptively align intra-class variance. The other one is the inter-class alignment branch, which is developed to adaptively alleviate inter-class variance changes on the representation level. Extensive experiment results on two benchmarks demonstrate that PLBR achieves state-of-the-art performance and its performance surpasses the previous SOTA by 2.11\% \sim 2.53\% 2.11%∼2.53% , 2.09\% \sim 2.49\% 2.09%∼2.49% F1-score on FUNSD and CORD with rare labeled samples, respectively. Code will be open to the public.
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2024.3443928