Loading…

ClickVOS: Click Video Object Segmentation

Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propo...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-03
Main Authors:	Guo, Pinxue, Hong, Lingyi, Zhou, Xinyu, Gao, Shuyong, Li, Wanyun, Li, Jinglun, Chen, Zhaoyu, Li, Xiaoqiang, Zhang, Wei, Zhang, Wenqiang
Format:	Article
Language:	English
Subjects:	Algorithms Annotations Datasets Segmentation Segments Target masking
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Guo, Pinxue Hong, Lingyi Zhou, Xinyu Gao, Shuyong Li, Wanyun Li, Jinglun Chen, Zhaoyu Li, Xiaoqiang Zhang, Wei Zhang, Wenqiang
description	Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2955956800</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2955956800</sourcerecordid><originalsourceid>FETCH-proquest_journals_29559568003</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQdM7JTM4O8w-2UgCzFMIyU1LzFfyTslKTSxSCU9NzU_NKEksy8_N4GFjTEnOKU3mhNDeDsptriLOHbkFRfmFpanFJfFZ-aVEeUCreyNLU1NLUzMLAwJg4VQB9oi_K</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2955956800</pqid></control><display><type>article</type><title>ClickVOS: Click Video Object Segmentation</title><source>Publicly Available Content Database</source><creator>Guo, Pinxue ; Hong, Lingyi ; Zhou, Xinyu ; Gao, Shuyong ; Li, Wanyun ; Li, Jinglun ; Chen, Zhaoyu ; Li, Xiaoqiang ; Zhang, Wei ; Zhang, Wenqiang</creator><creatorcontrib>Guo, Pinxue ; Hong, Lingyi ; Zhou, Xinyu ; Gao, Shuyong ; Li, Wanyun ; Li, Jinglun ; Chen, Zhaoyu ; Li, Xiaoqiang ; Zhang, Wei ; Zhang, Wenqiang</creatorcontrib><description>Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Annotations ; Datasets ; Segmentation ; Segments ; Target masking</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2955956800?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Guo, Pinxue</creatorcontrib><creatorcontrib>Hong, Lingyi</creatorcontrib><creatorcontrib>Zhou, Xinyu</creatorcontrib><creatorcontrib>Gao, Shuyong</creatorcontrib><creatorcontrib>Li, Wanyun</creatorcontrib><creatorcontrib>Li, Jinglun</creatorcontrib><creatorcontrib>Chen, Zhaoyu</creatorcontrib><creatorcontrib>Li, Xiaoqiang</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Zhang, Wenqiang</creatorcontrib><title>ClickVOS: Click Video Object Segmentation</title><title>arXiv.org</title><description>Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.</description><subject>Algorithms</subject><subject>Annotations</subject><subject>Datasets</subject><subject>Segmentation</subject><subject>Segments</subject><subject>Target masking</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQdM7JTM4O8w-2UgCzFMIyU1LzFfyTslKTSxSCU9NzU_NKEksy8_N4GFjTEnOKU3mhNDeDsptriLOHbkFRfmFpanFJfFZ-aVEeUCreyNLU1NLUzMLAwJg4VQB9oi_K</recordid><startdate>20240310</startdate><enddate>20240310</enddate><creator>Guo, Pinxue</creator><creator>Hong, Lingyi</creator><creator>Zhou, Xinyu</creator><creator>Gao, Shuyong</creator><creator>Li, Wanyun</creator><creator>Li, Jinglun</creator><creator>Chen, Zhaoyu</creator><creator>Li, Xiaoqiang</creator><creator>Zhang, Wei</creator><creator>Zhang, Wenqiang</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240310</creationdate><title>ClickVOS: Click Video Object Segmentation</title><author>Guo, Pinxue ; Hong, Lingyi ; Zhou, Xinyu ; Gao, Shuyong ; Li, Wanyun ; Li, Jinglun ; Chen, Zhaoyu ; Li, Xiaoqiang ; Zhang, Wei ; Zhang, Wenqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29559568003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Annotations</topic><topic>Datasets</topic><topic>Segmentation</topic><topic>Segments</topic><topic>Target masking</topic><toplevel>online_resources</toplevel><creatorcontrib>Guo, Pinxue</creatorcontrib><creatorcontrib>Hong, Lingyi</creatorcontrib><creatorcontrib>Zhou, Xinyu</creatorcontrib><creatorcontrib>Gao, Shuyong</creatorcontrib><creatorcontrib>Li, Wanyun</creatorcontrib><creatorcontrib>Li, Jinglun</creatorcontrib><creatorcontrib>Chen, Zhaoyu</creatorcontrib><creatorcontrib>Li, Xiaoqiang</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Zhang, Wenqiang</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Guo, Pinxue</au><au>Hong, Lingyi</au><au>Zhou, Xinyu</au><au>Gao, Shuyong</au><au>Li, Wanyun</au><au>Li, Jinglun</au><au>Chen, Zhaoyu</au><au>Li, Xiaoqiang</au><au>Zhang, Wei</au><au>Zhang, Wenqiang</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ClickVOS: Click Video Object Segmentation</atitle><jtitle>arXiv.org</jtitle><date>2024-03-10</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-03
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2955956800
source	Publicly Available Content Database
subjects	Algorithms Annotations Datasets Segmentation Segments Target masking
title	ClickVOS: Click Video Object Segmentation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T12%3A24%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ClickVOS:%20Click%20Video%20Object%20Segmentation&rft.jtitle=arXiv.org&rft.au=Guo,%20Pinxue&rft.date=2024-03-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2955956800%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29559568003%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2955956800&rft_id=info:pmid/&rfr_iscdi=true