Loading…

Relation-Aware Heterogeneous Graph Network for Learning Intermodal Semantics in Textbook Question Answering

Textbook question answering (TQA) task aims to infer answers for given questions from a multimodal context, including text and diagrams. The existing studies have aggregated intramodal semantics extracted from a single modality but have yet to capture the intermodal semantics between different modal...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transaction on neural networks and learning systems 2024-09, Vol.35 (9), p.11872-11883
Main Authors: Zhang, Sai, Wu, Yunjie, Zhang, Xiaowang, Feng, Zhiyong, Wan, Liang, Zhuang, Zhiqiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Textbook question answering (TQA) task aims to infer answers for given questions from a multimodal context, including text and diagrams. The existing studies have aggregated intramodal semantics extracted from a single modality but have yet to capture the intermodal semantics between different modalities. A major challenge in learning intermodal semantics is maintaining lossless intramodal semantics while bridging the gap of semantics caused by heterogeneity. In this article, we propose an intermodal relation-aware heterogeneous graph network (IMR-HGN) to extract the intermodal semantics for TQA, which aggregates different modalities while learning features rather than representing them independently. First, we design a multidomain consistent representation (MDCR) to eliminate semantic gaps by capturing intermodal features while maintaining lossless intramodal semantics in multidomains. Furthermore, we present neighbor-based relation inpainting (NRI) to reduce semantic ambiguity via repairing fuzzy relations with correlations of relations. Finally, we propose hierarchical multisemantics aggregation (HMSA) to guarantee the completeness of semantics by aggregating features of nodes and relations with a reconstruction network (RN). Experimental results show that IMR-HGN could extract the intermodal semantics of answers, achieving a 2.16% improvement on the validation set of the TQA dataset and a 3.04% increase on the test set of the AI2D dataset.
ISSN:2162-237X
2162-2388
2162-2388
DOI:10.1109/TNNLS.2024.3385436