Loading…

Context Enhanced Transformer for Single Image Object Detection

With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-12
Main Authors: An, Seungjun, Park, Seonghoon, Kim, Gyeongnyeon, Baek, Jeongyeol, Lee, Byeongwon, Kim, Seungryong
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator An, Seungjun
Park, Seonghoon
Kim, Gyeongnyeon
Baek, Jeongyeol
Lee, Byeongwon
Kim, Seungryong
description With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2905670508</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2905670508</sourcerecordid><originalsourceid>FETCH-proquest_journals_29056705083</originalsourceid><addsrcrecordid>eNqNyrEOgjAUQNHGxESi_MNLnElqSwEXF8To5CA7qfhACLxqWxI_XwY_wOkM9y5YIKTcRVksxIqFzvWcc5GkQikZsENuyOPHQ0FPTTU-oLSaXGPsiBZm4NZROyBcRt0iXO891h6O6Gc6Qxu2bPTgMPy5ZttTUebn6GXNe0Lnq95MluZUiT1XScoVz-R_1xea9jgH</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2905670508</pqid></control><display><type>article</type><title>Context Enhanced Transformer for Single Image Object Detection</title><source>Publicly Available Content Database</source><creator>An, Seungjun ; Park, Seonghoon ; Kim, Gyeongnyeon ; Baek, Jeongyeol ; Lee, Byeongwon ; Kim, Seungryong</creator><creatorcontrib>An, Seungjun ; Park, Seonghoon ; Kim, Gyeongnyeon ; Baek, Jeongyeol ; Lee, Byeongwon ; Kim, Seungryong</creatorcontrib><description>With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Context ; Image enhancement ; Object recognition ; Sampling methods ; Testing time ; Transformers ; Video data</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2905670508?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>An, Seungjun</creatorcontrib><creatorcontrib>Park, Seonghoon</creatorcontrib><creatorcontrib>Kim, Gyeongnyeon</creatorcontrib><creatorcontrib>Baek, Jeongyeol</creatorcontrib><creatorcontrib>Lee, Byeongwon</creatorcontrib><creatorcontrib>Kim, Seungryong</creatorcontrib><title>Context Enhanced Transformer for Single Image Object Detection</title><title>arXiv.org</title><description>With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.</description><subject>Context</subject><subject>Image enhancement</subject><subject>Object recognition</subject><subject>Sampling methods</subject><subject>Testing time</subject><subject>Transformers</subject><subject>Video data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyrEOgjAUQNHGxESi_MNLnElqSwEXF8To5CA7qfhACLxqWxI_XwY_wOkM9y5YIKTcRVksxIqFzvWcc5GkQikZsENuyOPHQ0FPTTU-oLSaXGPsiBZm4NZROyBcRt0iXO891h6O6Gc6Qxu2bPTgMPy5ZttTUebn6GXNe0Lnq95MluZUiT1XScoVz-R_1xea9jgH</recordid><startdate>20231226</startdate><enddate>20231226</enddate><creator>An, Seungjun</creator><creator>Park, Seonghoon</creator><creator>Kim, Gyeongnyeon</creator><creator>Baek, Jeongyeol</creator><creator>Lee, Byeongwon</creator><creator>Kim, Seungryong</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231226</creationdate><title>Context Enhanced Transformer for Single Image Object Detection</title><author>An, Seungjun ; Park, Seonghoon ; Kim, Gyeongnyeon ; Baek, Jeongyeol ; Lee, Byeongwon ; Kim, Seungryong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29056705083</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Context</topic><topic>Image enhancement</topic><topic>Object recognition</topic><topic>Sampling methods</topic><topic>Testing time</topic><topic>Transformers</topic><topic>Video data</topic><toplevel>online_resources</toplevel><creatorcontrib>An, Seungjun</creatorcontrib><creatorcontrib>Park, Seonghoon</creatorcontrib><creatorcontrib>Kim, Gyeongnyeon</creatorcontrib><creatorcontrib>Baek, Jeongyeol</creatorcontrib><creatorcontrib>Lee, Byeongwon</creatorcontrib><creatorcontrib>Kim, Seungryong</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>An, Seungjun</au><au>Park, Seonghoon</au><au>Kim, Gyeongnyeon</au><au>Baek, Jeongyeol</au><au>Lee, Byeongwon</au><au>Kim, Seungryong</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Context Enhanced Transformer for Single Image Object Detection</atitle><jtitle>arXiv.org</jtitle><date>2023-12-26</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_2905670508
source Publicly Available Content Database
subjects Context
Image enhancement
Object recognition
Sampling methods
Testing time
Transformers
Video data
title Context Enhanced Transformer for Single Image Object Detection
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T23%3A43%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Context%20Enhanced%20Transformer%20for%20Single%20Image%20Object%20Detection&rft.jtitle=arXiv.org&rft.au=An,%20Seungjun&rft.date=2023-12-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2905670508%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29056705083%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2905670508&rft_id=info:pmid/&rfr_iscdi=true