Loading…
Deep multimodal learning for cross-modal retrieval: One model for all tasks
•Study of VQA systems for cross-modal retrieval tasks.•Baselines in several classification and cross-modal tasks for well know databases.•Evaluation of multiple cross-modal retrieval tasks from an end-to-end system.•Approaches that improve the search experience by using all data present in articles....
Saved in:
Published in: | Pattern recognition letters 2021-06, Vol.146, p.38-45 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •Study of VQA systems for cross-modal retrieval tasks.•Baselines in several classification and cross-modal tasks for well know databases.•Evaluation of multiple cross-modal retrieval tasks from an end-to-end system.•Approaches that improve the search experience by using all data present in articles.•End-to-end approach to perform specialized unimodal and crossmodal retrieval.
We investigate the effectiveness of a successful model in Visual-Question-Answering (VQA) problems as the core component in a cross-modal retrieval system that can accept images or text as queries, in order to retrieve relevant data from a multimodal document collection. To this end, we adapt the VQA model for deep multimodal learning to combine visual and textual representations for information search, and we call this model “Deep Multimodal Embeddings (DME)”. Instead of training the model to answer questions, we supervise DME to classify semantic topics/concepts previously identified in the document collection of interest. In contrast to previous approaches, we found that this model can handle any multimodal query with a single architecture while producing improved or competitive results in all retrieval tasks. We evaluate the model performance with 3 widely known databases for cross-modal retrieval tasks: Wikipedia Retrieval Database, Pascal Sentences, and MIR-Flickr-25k. The results show that the DME model learns effective multimodal representations, resulting in strongly improved retrieval performance, specifically in the top results of the ranked list, which are the most important to users in the most-common scenarios of information retrieval. Our work represents a new baseline for a wide set of different methodologies for cross-modal retrieval. |
---|---|
ISSN: | 0167-8655 1872-7344 |
DOI: | 10.1016/j.patrec.2021.02.021 |