Loading…
Relation-Aware Heterogeneous Graph Network for Learning Intermodal Semantics in Textbook Question Answering
Textbook question answering (TQA) task aims to infer answers for given questions from a multimodal context, including text and diagrams. The existing studies have aggregated intramodal semantics extracted from a single modality but have yet to capture the intermodal semantics between different modal...
Saved in:
Published in: | IEEE transaction on neural networks and learning systems 2024-09, Vol.35 (9), p.11872-11883 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Textbook question answering (TQA) task aims to infer answers for given questions from a multimodal context, including text and diagrams. The existing studies have aggregated intramodal semantics extracted from a single modality but have yet to capture the intermodal semantics between different modalities. A major challenge in learning intermodal semantics is maintaining lossless intramodal semantics while bridging the gap of semantics caused by heterogeneity. In this article, we propose an intermodal relation-aware heterogeneous graph network (IMR-HGN) to extract the intermodal semantics for TQA, which aggregates different modalities while learning features rather than representing them independently. First, we design a multidomain consistent representation (MDCR) to eliminate semantic gaps by capturing intermodal features while maintaining lossless intramodal semantics in multidomains. Furthermore, we present neighbor-based relation inpainting (NRI) to reduce semantic ambiguity via repairing fuzzy relations with correlations of relations. Finally, we propose hierarchical multisemantics aggregation (HMSA) to guarantee the completeness of semantics by aggregating features of nodes and relations with a reconstruction network (RN). Experimental results show that IMR-HGN could extract the intermodal semantics of answers, achieving a 2.16% improvement on the validation set of the TQA dataset and a 3.04% increase on the test set of the AI2D dataset. |
---|---|
ISSN: | 2162-237X 2162-2388 2162-2388 |
DOI: | 10.1109/TNNLS.2024.3385436 |