Loading…

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1)...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-07
Main Authors: Iioka, Yui, Yoshida, Yu, Wada, Yuiga, Hatanaka, Shumpei, Sugiura, Komei
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Iioka, Yui
Yoshida, Yu
Wada, Yuiga
Hatanaka, Shumpei
Sugiura, Komei
description In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2838871548</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2838871548</sourcerecordid><originalsourceid>FETCH-proquest_journals_28388715483</originalsourceid><addsrcrecordid>eNqNir0KwjAYRYMgWLTvEHAutEljs_uDDsVBZ0tsE0lJk5oveX8turg53XvPuTOUEEqLjJeELFAK0Od5TjYVYYwm6FZHE_TgOmHwTisVQTuLL_IxSBtEmEbtOmmwch6f771sw69V3g24FlaP0XzIyULwsZ06rNBcCQMy_eYSrQ_76_aYjd49o4TQ9C56-1YN4ZTzqmAlp_-9Xt7iRVM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2838871548</pqid></control><display><type>article</type><title>Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Iioka, Yui ; Yoshida, Yu ; Wada, Yuiga ; Hatanaka, Shumpei ; Sugiura, Komei</creator><creatorcontrib>Iioka, Yui ; Yoshida, Yu ; Wada, Yuiga ; Hatanaka, Shumpei ; Sugiura, Komei</creatorcontrib><description>In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Feature extraction ; Image segmentation ; Indoor environments ; Language instruction ; Masks ; Pixels ; Probabilistic models ; Sentences ; Target masking</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2838871548?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25752,37011,44589</link.rule.ids></links><search><creatorcontrib>Iioka, Yui</creatorcontrib><creatorcontrib>Yoshida, Yu</creatorcontrib><creatorcontrib>Wada, Yuiga</creatorcontrib><creatorcontrib>Hatanaka, Shumpei</creatorcontrib><creatorcontrib>Sugiura, Komei</creatorcontrib><title>Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions</title><title>arXiv.org</title><description>In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.</description><subject>Datasets</subject><subject>Feature extraction</subject><subject>Image segmentation</subject><subject>Indoor environments</subject><subject>Language instruction</subject><subject>Masks</subject><subject>Pixels</subject><subject>Probabilistic models</subject><subject>Sentences</subject><subject>Target masking</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNir0KwjAYRYMgWLTvEHAutEljs_uDDsVBZ0tsE0lJk5oveX8turg53XvPuTOUEEqLjJeELFAK0Od5TjYVYYwm6FZHE_TgOmHwTisVQTuLL_IxSBtEmEbtOmmwch6f771sw69V3g24FlaP0XzIyULwsZ06rNBcCQMy_eYSrQ_76_aYjd49o4TQ9C56-1YN4ZTzqmAlp_-9Xt7iRVM</recordid><startdate>20230717</startdate><enddate>20230717</enddate><creator>Iioka, Yui</creator><creator>Yoshida, Yu</creator><creator>Wada, Yuiga</creator><creator>Hatanaka, Shumpei</creator><creator>Sugiura, Komei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230717</creationdate><title>Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions</title><author>Iioka, Yui ; Yoshida, Yu ; Wada, Yuiga ; Hatanaka, Shumpei ; Sugiura, Komei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28388715483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Datasets</topic><topic>Feature extraction</topic><topic>Image segmentation</topic><topic>Indoor environments</topic><topic>Language instruction</topic><topic>Masks</topic><topic>Pixels</topic><topic>Probabilistic models</topic><topic>Sentences</topic><topic>Target masking</topic><toplevel>online_resources</toplevel><creatorcontrib>Iioka, Yui</creatorcontrib><creatorcontrib>Yoshida, Yu</creatorcontrib><creatorcontrib>Wada, Yuiga</creatorcontrib><creatorcontrib>Hatanaka, Shumpei</creatorcontrib><creatorcontrib>Sugiura, Komei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Iioka, Yui</au><au>Yoshida, Yu</au><au>Wada, Yuiga</au><au>Hatanaka, Shumpei</au><au>Sugiura, Komei</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions</atitle><jtitle>arXiv.org</jtitle><date>2023-07-17</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-07
issn 2331-8422
language eng
recordid cdi_proquest_journals_2838871548
source Publicly Available Content Database (Proquest) (PQ_SDU_P3)
subjects Datasets
Feature extraction
Image segmentation
Indoor environments
Language instruction
Masks
Pixels
Probabilistic models
Sentences
Target masking
title Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A37%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Multimodal%20Diffusion%20Segmentation%20Model%20for%20Object%20Segmentation%20from%20Manipulation%20Instructions&rft.jtitle=arXiv.org&rft.au=Iioka,%20Yui&rft.date=2023-07-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2838871548%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28388715483%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2838871548&rft_id=info:pmid/&rfr_iscdi=true