Loading…

CGFormer: ViT-Based Network for Identifying Computer-Generated Images With Token Labeling

The advanced graphics rendering techniques and image generation algorithms significantly improve the visual quality of computer-generated (CG) images, and this makes it more challenging to distinguish between CG images and natural images (NIs) for a forensic detector. For the identification of CG im...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on information forensics and security 2024, Vol.19, p.235-250
Main Authors: Quan, Weize, Deng, Pengfei, Wang, Kai, Yan, Dong-Ming
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The advanced graphics rendering techniques and image generation algorithms significantly improve the visual quality of computer-generated (CG) images, and this makes it more challenging to distinguish between CG images and natural images (NIs) for a forensic detector. For the identification of CG images, human beings often need to inspect and evaluate the entire image and its local region as well. In addition, we observe that the distributions of both near and far patch-wise correlation have differences between CG images and NIs. Current mainstream methods adopt the CNN-based architecture with the classical cross entropy loss, however, there are several limitations: 1) the weakness of long-distance relationship modeling of image content due to the local receptive field of CNN; 2) the pixel sensitivity due to the convolutional computation; 3) the insufficient supervision due to the training loss on the whole image. In this paper, we propose a novel vision transformer (ViT)-based network with token labeling for CG image identification. Our network, called CGFormer, consists of patch embedding, feature modeling, and token prediction. We apply patch embedding to sequence the input image and weaken the pixel sensitivity. Stacked multi-head attention-based transformer blocks are utilized to model the patch-wise relationship and introduce a certain level of adaptability. Besides the conventional classification loss on class token of the whole image, we additionally introduce a soft cross entropy loss on patch tokens to comprehensively exploit the supervision information from local patches. Extensive experiments demonstrate that our method achieves the state-of-the-art forensic performance on six publicly available datasets in terms of classification accuracy, generalization, and robustness. Code is available at https://github.com/feipiefei/CGFormer .
ISSN:1556-6013
1556-6021
DOI:10.1109/TIFS.2023.3322083