Loading…

Usage of compressed domain in fast frameworks

There has been considerable progress in the applications of Convolutional Neural Networks (CNNs) to computer vision tasks with RGB images. A few studies investigated gaining more performance by replacing RGB representation with block-wise Discrete Cosine Transform (DCT) coefficients. DCT coefficient...

Full description

Saved in:

Bibliographic Details
Published in:	Signal, image and video processing image and video processing, 2022-10, Vol.16 (7), p.1763-1771
Main Authors:	Arslan, Hasan Sait, Archambault, Simon, Bhatt, Prakruti, Watanabe, Keita, Cuevaz, Josue, Le, Phuc, Miller, Denis, Zhumatiy, Viktor
Format:	Article
Language:	English
Subjects:	Artificial neural networks Benchmarks Coefficients Color imagery Computer Imaging Computer Science Computer vision Datasets Discrete cosine transform Image classification Image compression Image Processing and Computer Vision Mathematical models Multimedia Information Systems Object recognition Original Paper Pattern Recognition and Graphics Performance evaluation Representations Signal,Image and Speech Processing Vision
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	There has been considerable progress in the applications of Convolutional Neural Networks (CNNs) to computer vision tasks with RGB images. A few studies investigated gaining more performance by replacing RGB representation with block-wise Discrete Cosine Transform (DCT) coefficients. DCT coefficients that are readily available during JPEG decoding might be competitive with the output of computationally costly initial CNN layers fed by RGB representation. Despite the attractiveness of the approach, up to our knowledge, there is only a single study targeting the use of DCT coefficients with the low-latency models. In this paper, we investigate the usage of DCT coefficients firstly with MnasNet, a mobile image classification model processing thousands of images per second on a single modern GPU, and secondly with Yolov5, which holds the benchmark performance on Average Precision (AP) and latency. After applying our methods to MnasNet (1.0) and evaluating performance on the ImageNet dataset, we observe competitive accuracy with RGB-based MnasNet (1.0) and significantly higher processing speed compared to RGB-based MnasNet (0.5). After applying our methods to Yolov5, we evaluate performance on three benchmark datasets. The resulting DCT-based object detection model processes up to 519 more images per second, while demonstrating up to 4.7% AP drop on MSCOCO test-dev set, up to 5.1% AP drop on Pascal VOC 2007 test set, and up to 3.8% AP drop on Crowd Human (Full-Body) validation set.
ISSN:	1863-1703 1863-1711
DOI:	10.1007/s11760-022-02133-2