Loading…

Usage of compressed domain in fast frameworks

There has been considerable progress in the applications of Convolutional Neural Networks (CNNs) to computer vision tasks with RGB images. A few studies investigated gaining more performance by replacing RGB representation with block-wise Discrete Cosine Transform (DCT) coefficients. DCT coefficient...

Full description

Saved in:
Bibliographic Details
Published in:Signal, image and video processing image and video processing, 2022-10, Vol.16 (7), p.1763-1771
Main Authors: Arslan, Hasan Sait, Archambault, Simon, Bhatt, Prakruti, Watanabe, Keita, Cuevaz, Josue, Le, Phuc, Miller, Denis, Zhumatiy, Viktor
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:There has been considerable progress in the applications of Convolutional Neural Networks (CNNs) to computer vision tasks with RGB images. A few studies investigated gaining more performance by replacing RGB representation with block-wise Discrete Cosine Transform (DCT) coefficients. DCT coefficients that are readily available during JPEG decoding might be competitive with the output of computationally costly initial CNN layers fed by RGB representation. Despite the attractiveness of the approach, up to our knowledge, there is only a single study targeting the use of DCT coefficients with the low-latency models. In this paper, we investigate the usage of DCT coefficients firstly with MnasNet, a mobile image classification model processing thousands of images per second on a single modern GPU, and secondly with Yolov5, which holds the benchmark performance on Average Precision (AP) and latency. After applying our methods to MnasNet (1.0) and evaluating performance on the ImageNet dataset, we observe competitive accuracy with RGB-based MnasNet (1.0) and significantly higher processing speed compared to RGB-based MnasNet (0.5). After applying our methods to Yolov5, we evaluate performance on three benchmark datasets. The resulting DCT-based object detection model processes up to 519 more images per second, while demonstrating up to 4.7% AP drop on MSCOCO test-dev set, up to 5.1% AP drop on Pascal VOC 2007 test set, and up to 3.8% AP drop on Crowd Human (Full-Body) validation set.
ISSN:1863-1703
1863-1711
DOI:10.1007/s11760-022-02133-2