Loading…

VOLTER: Visual Collaboration and Dual-Stream Fusion for Scene Text Recognition

Recently, the approaches of linguistic modeling for scene text recognition have gradually become mainstream, mainly consisting of a vision model (VM), a language model (LM), and an optional fusion module. These methods typically use LM and fusion modules to refine the results of VM-based predictions...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on multimedia 2024, Vol.26, p.6437-6448
Main Authors: Li, Jia-Nan, Liu, Xiao-Qian, Luo, Xin, Xu, Xin-Shun
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recently, the approaches of linguistic modeling for scene text recognition have gradually become mainstream, mainly consisting of a vision model (VM), a language model (LM), and an optional fusion module. These methods typically use LM and fusion modules to refine the results of VM-based predictions iteratively. However, the VM mainly consists of a Transformer on top of ResNet. It means the attention mechanism is only applied to the high layer of the VM, ignoring the internal image dependencies in the dense features at multiple scales. Therefore, the results in the VM become the performance bottleneck. Meanwhile, the visual and language features of these methods reside in their own space. In this way, it ignores the alignment before fusion, leading to a failure to achieve maximum information interaction. To address these issues, we propose Visual cOllaboration and duaL-stream fusion for scene TExt Recognition, VOLTER for short. Firstly, a multi-stage Local-Global Collaboration Vision Model (LGC-VM) is constructed to focus on both local and global features at multiple scales, breaking vision bottlenecks to provide a better vision prediction. Secondly, to explicitly align the feature space of VM and LM, we introduce a Vision-Language Contrastive (VLC) module by encouraging positive vision-language pairs to have similar representations. Moreover, a Dual-Stream Feature Enhancement (DSFE) module is proposed for the unidirectional interaction of visual-language features to alleviate the alignment problem of different modalities and execute fusion further. Extensive experiments on benchmark datasets demonstrate that the proposed framework can achieve state-of-the-art performance.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2024.3350916