Loading…

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consiste...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fedynyak, Volodymyr, Romanus, Yaroslav, Hlovatskyi, Bohdan, Sydor, Bohdan, Dobosevych, Oles, Babin, Igor, Riazantsev, Roman
Format:	Conference Proceeding
Language:	English
Subjects:	Adaptation models Algorithms Applications Computer vision Deformable models Image recognition and understanding Memory management Remote Sensing Semantics Shape Tracking Video recognition and understanding
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	248
container_issue
container_start_page	239
container_title
container_volume
creator	Fedynyak, Volodymyr Romanus, Yaroslav Hlovatskyi, Bohdan Sydor, Bohdan Dobosevych, Oles Babin, Igor Riazantsev, Roman
description	The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption.
doi_str_mv	10.1109/WACV57701.2024.00031
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10484285</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10484285</ieee_id><sourcerecordid>10484285</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-b0067f8472e221dc9b0f2e6d740bb9547ce18faa845b2b02cf05b2b75907cc383</originalsourceid><addsrcrecordid>eNotjMtKw0AYRkdBsNa-QRfzAon_3DIz7kpqo1DIojUuy0zyj6TkIklEfHtT7OqcDw4fIWsGMWNgnz42aaG0BhZz4DIGAMFuyMpqa4Sa3VgOt2TBE8kjKwy7Jw_jeJ4ry6xYkGyLRX54prum_4my77rCim4x9EPrfIP0OLhuvCwc6AxazEFPc3_GcqIH_Gyxm9xU990juQuuGXF15ZK8716O6Wu0z7O3dLOPasbsFHmARAcjNUfOWVVaD4FjUmkJ3lsldYnMBOeMVJ574GWAi2hlQZelMGJJ1v-_NSKevoa6dcPviYE0khsl_gCbVUwT</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation</title><source>IEEE Xplore All Conference Series</source><creator>Fedynyak, Volodymyr ; Romanus, Yaroslav ; Hlovatskyi, Bohdan ; Sydor, Bohdan ; Dobosevych, Oles ; Babin, Igor ; Riazantsev, Roman</creator><creatorcontrib>Fedynyak, Volodymyr ; Romanus, Yaroslav ; Hlovatskyi, Bohdan ; Sydor, Bohdan ; Dobosevych, Oles ; Babin, Igor ; Riazantsev, Roman</creatorcontrib><description>The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption.</description><identifier>EISSN: 2642-9381</identifier><identifier>EISBN: 9798350318920</identifier><identifier>DOI: 10.1109/WACV57701.2024.00031</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Algorithms ; Applications ; Computer vision ; Deformable models ; Image recognition and understanding ; Memory management ; Remote Sensing ; Semantics ; Shape ; Tracking ; Video recognition and understanding</subject><ispartof>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.239-248</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10484285$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10484285$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Fedynyak, Volodymyr</creatorcontrib><creatorcontrib>Romanus, Yaroslav</creatorcontrib><creatorcontrib>Hlovatskyi, Bohdan</creatorcontrib><creatorcontrib>Sydor, Bohdan</creatorcontrib><creatorcontrib>Dobosevych, Oles</creatorcontrib><creatorcontrib>Babin, Igor</creatorcontrib><creatorcontrib>Riazantsev, Roman</creatorcontrib><title>DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation</title><title>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</title><addtitle>WACV</addtitle><description>The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption.</description><subject>Adaptation models</subject><subject>Algorithms</subject><subject>Applications</subject><subject>Computer vision</subject><subject>Deformable models</subject><subject>Image recognition and understanding</subject><subject>Memory management</subject><subject>Remote Sensing</subject><subject>Semantics</subject><subject>Shape</subject><subject>Tracking</subject><subject>Video recognition and understanding</subject><issn>2642-9381</issn><isbn>9798350318920</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotjMtKw0AYRkdBsNa-QRfzAon_3DIz7kpqo1DIojUuy0zyj6TkIklEfHtT7OqcDw4fIWsGMWNgnz42aaG0BhZz4DIGAMFuyMpqa4Sa3VgOt2TBE8kjKwy7Jw_jeJ4ry6xYkGyLRX54prum_4my77rCim4x9EPrfIP0OLhuvCwc6AxazEFPc3_GcqIH_Gyxm9xU990juQuuGXF15ZK8716O6Wu0z7O3dLOPasbsFHmARAcjNUfOWVVaD4FjUmkJ3lsldYnMBOeMVJ574GWAi2hlQZelMGJJ1v-_NSKevoa6dcPviYE0khsl_gCbVUwT</recordid><startdate>20240103</startdate><enddate>20240103</enddate><creator>Fedynyak, Volodymyr</creator><creator>Romanus, Yaroslav</creator><creator>Hlovatskyi, Bohdan</creator><creator>Sydor, Bohdan</creator><creator>Dobosevych, Oles</creator><creator>Babin, Igor</creator><creator>Riazantsev, Roman</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240103</creationdate><title>DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation</title><author>Fedynyak, Volodymyr ; Romanus, Yaroslav ; Hlovatskyi, Bohdan ; Sydor, Bohdan ; Dobosevych, Oles ; Babin, Igor ; Riazantsev, Roman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-b0067f8472e221dc9b0f2e6d740bb9547ce18faa845b2b02cf05b2b75907cc383</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Algorithms</topic><topic>Applications</topic><topic>Computer vision</topic><topic>Deformable models</topic><topic>Image recognition and understanding</topic><topic>Memory management</topic><topic>Remote Sensing</topic><topic>Semantics</topic><topic>Shape</topic><topic>Tracking</topic><topic>Video recognition and understanding</topic><toplevel>online_resources</toplevel><creatorcontrib>Fedynyak, Volodymyr</creatorcontrib><creatorcontrib>Romanus, Yaroslav</creatorcontrib><creatorcontrib>Hlovatskyi, Bohdan</creatorcontrib><creatorcontrib>Sydor, Bohdan</creatorcontrib><creatorcontrib>Dobosevych, Oles</creatorcontrib><creatorcontrib>Babin, Igor</creatorcontrib><creatorcontrib>Riazantsev, Roman</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fedynyak, Volodymyr</au><au>Romanus, Yaroslav</au><au>Hlovatskyi, Bohdan</au><au>Sydor, Bohdan</au><au>Dobosevych, Oles</au><au>Babin, Igor</au><au>Riazantsev, Roman</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation</atitle><btitle>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</btitle><stitle>WACV</stitle><date>2024-01-03</date><risdate>2024</risdate><spage>239</spage><epage>248</epage><pages>239-248</pages><eissn>2642-9381</eissn><eisbn>9798350318920</eisbn><coden>IEEPAD</coden><abstract>The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption.</abstract><pub>IEEE</pub><doi>10.1109/WACV57701.2024.00031</doi><tpages>10</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2642-9381
ispartof	2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.239-248
issn	2642-9381
language	eng
recordid	cdi_ieee_primary_10484285
source	IEEE Xplore All Conference Series
subjects	Adaptation models Algorithms Applications Computer vision Deformable models Image recognition and understanding Memory management Remote Sensing Semantics Shape Tracking Video recognition and understanding
title	DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T21%3A44%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=DeVOS:%20Flow-Guided%20Deformable%20Transformer%20for%20Video%20Object%20Segmentation&rft.btitle=2024%20IEEE/CVF%20Winter%20Conference%20on%20Applications%20of%20Computer%20Vision%20(WACV)&rft.au=Fedynyak,%20Volodymyr&rft.date=2024-01-03&rft.spage=239&rft.epage=248&rft.pages=239-248&rft.eissn=2642-9381&rft.coden=IEEPAD&rft_id=info:doi/10.1109/WACV57701.2024.00031&rft.eisbn=9798350318920&rft_dat=%3Cieee_CHZPO%3E10484285%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-b0067f8472e221dc9b0f2e6d740bb9547ce18faa845b2b02cf05b2b75907cc383%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10484285&rfr_iscdi=true