Loading…

Text-Image De-Contextualization Detection Using Vision-Language Models

Text-image de-contextualization, which uses inconsistent image-text pairs, is an emerging form of misinformation and drawing increasing attention due to the great threat to information authenticity. With real content but semantic mismatch in multiple modalities, the detection of de-contextualization...

Full description

Saved in:
Bibliographic Details
Main Authors: Huang, Mingzhen, Jia, Shan, Chang, Ming-Ching, Lyu, Siwei
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Text-image de-contextualization, which uses inconsistent image-text pairs, is an emerging form of misinformation and drawing increasing attention due to the great threat to information authenticity. With real content but semantic mismatch in multiple modalities, the detection of de-contextualization is a challenging problem in media forensics. Inspired by the recent advances in vision-language models with powerful relationship learning between images and texts, we leverage the vision-language models to the media de-contextualization detection task. Two popular models, namely CLIP and VinVL, are evaluated and compared on several news and social media datasets to show their performance in detecting image-text inconsistency in de-contextualization. We also summarize interesting observations and shed lights to the use of vision-language models in de-contextualization detection.
ISSN:2379-190X
DOI:10.1109/ICASSP43922.2022.9746193