Loading…

Bridging the gap between object detection in close-up and high-resolution wide shots

Recent years have seen a significant rise in gigapixel-level image/video capture systems and benchmarks with high-resolution wide (HRW) shots. Different from close-up shots like MS COCO, the higher resolution and wider field of view raise new research and application problems, such as how to perform...

Full description

Saved in:
Bibliographic Details
Published in:Computer vision and image understanding 2024-12, Vol.249, p.104181, Article 104181
Main Authors: Li, Wenxi, Guo, Yuchen, Zheng, Jilai, Lin, Haozhe, Ma, Chao, Fang, Lu, Yang, Xiaokang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent years have seen a significant rise in gigapixel-level image/video capture systems and benchmarks with high-resolution wide (HRW) shots. Different from close-up shots like MS COCO, the higher resolution and wider field of view raise new research and application problems, such as how to perform accurate and efficient object detection with such large input in low-power edge devices like UAVs. There are several unique challenges in HRW shots. (1) Sparse information: the objects of interest cover less area. (2) Various scale: there is 10 to 100× object scale change in one single image. (3) Incomplete objects: the sliding window strategy to handle the large input leads to truncated objects at the window edge. (4) Multi-scale information: it is unclear how to use multi-scale information in training and inference. Consequently, directly using a close-up detector leads to inaccuracy and inefficiency. In this paper, we systematically investigate this problem and bridge the gap between object detection in close-up and HRW shots, by introducing a novel sparse architecture that can be integrated with common networks like ConvNet and Transformer. It leverages alternative sparse learning to complementarily fuse coarse-grained and fine-grained features to (1) adaptively extract valuable information from (2) different object scales. We also propose a novel Cross-window Non-Maximum Suppression (C-NMS) algorithm to (3) improve the box merge from different windows. Furthermore, we propose a (4) simple yet effective multi-scale training and inference strategy to improve accuracy. Experiments on two benchmarks with HRW shots, PANDA and DOTA-v1.0, demonstrate that our methods significantly improve accuracy (up to 5.8%) and speed (up to 3×) over SotAs, for both ConvNet or Transformer based detectors, on edge devices. Our code is open-sourced and available at https://github.com/liwenxi/SparseFormer. •Introduces a sparse architecture for HRW shots, optimizing object detection accuracy and efficiency.•Proposes Cross-window NMS (C-NMS) to improve detection of incomplete objects in HRW imagery.•Employs multi-scale augmentation enhancing feature learning across variable object scales.•Validates on PANDA and DOTA-v1.0, showing significant improvements over state-of-the-art methods.•Demonstrates potential for UAVs and low-power devices in real-world applications.
ISSN:1077-3142
DOI:10.1016/j.cviu.2024.104181