Loading…

CoD: Coherent Detection of Entities from Images with Multiple Modalities

Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging trad...

Full description

Saved in:

Bibliographic Details
Main Authors:	Verma, Vinay, Sanny, Dween, Singh, Abhishek, Gupta, Deepak
Format:	Conference Proceeding
Language:	English
Subjects:	Adaptation models Algorithms Annotations Applications Commercial / retail Computational modeling Computer vision Robotics Soft sensors Training Vision + language and/or other modalities Visualization
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	8009
container_issue
container_start_page	8000
container_title
container_volume
creator	Verma, Vinay Sanny, Dween Singh, Abhishek Gupta, Deepak
description	Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.
doi_str_mv	10.1109/WACV57701.2024.00783
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10484189</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10484189</ieee_id><sourcerecordid>10484189</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733</originalsourceid><addsrcrecordid>eNotz81OwzAQBGCDhEQpfYMe_AIpu17HjrlVaaGVWnHh51g56YYa5adKjBBvTwScZg6jkT4h5ggLRHB3b8v8NbUWcKFA6QWAzehCzJx1GaVAmDkFl2KijFaJowyvxc0wfACQQ0cTscm71b3MuxP33Ea54shlDF0ru0qu2xhi4EFWfdfIbePfx_4V4knuP-sYzjXLfXf09e_oVlxVvh549p9T8fKwfs43ye7pcZsvd0lAdDEpj-RLYwrgrEJXmLTShHaEFF6npnTg0CoyhhQWmrgYZRUj-UIZa8gSTcX87zcw8-Hch8b33wcEnenRSj_9WUu_</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><source>IEEE Xplore All Conference Series</source><creator>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</creator><creatorcontrib>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</creatorcontrib><description>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</description><identifier>EISSN: 2642-9381</identifier><identifier>EISBN: 9798350318920</identifier><identifier>DOI: 10.1109/WACV57701.2024.00783</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Algorithms ; Annotations ; Applications ; Commercial / retail ; Computational modeling ; Computer vision ; Robotics ; Soft sensors ; Training ; Vision + language and/or other modalities ; Visualization</subject><ispartof>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.8000-8009</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10484189$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27907,54537,54914</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10484189$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Verma, Vinay</creatorcontrib><creatorcontrib>Sanny, Dween</creatorcontrib><creatorcontrib>Singh, Abhishek</creatorcontrib><creatorcontrib>Gupta, Deepak</creatorcontrib><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><title>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</title><addtitle>WACV</addtitle><description>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</description><subject>Adaptation models</subject><subject>Algorithms</subject><subject>Annotations</subject><subject>Applications</subject><subject>Commercial / retail</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Robotics</subject><subject>Soft sensors</subject><subject>Training</subject><subject>Vision + language and/or other modalities</subject><subject>Visualization</subject><issn>2642-9381</issn><isbn>9798350318920</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotz81OwzAQBGCDhEQpfYMe_AIpu17HjrlVaaGVWnHh51g56YYa5adKjBBvTwScZg6jkT4h5ggLRHB3b8v8NbUWcKFA6QWAzehCzJx1GaVAmDkFl2KijFaJowyvxc0wfACQQ0cTscm71b3MuxP33Ea54shlDF0ru0qu2xhi4EFWfdfIbePfx_4V4knuP-sYzjXLfXf09e_oVlxVvh549p9T8fKwfs43ye7pcZsvd0lAdDEpj-RLYwrgrEJXmLTShHaEFF6npnTg0CoyhhQWmrgYZRUj-UIZa8gSTcX87zcw8-Hch8b33wcEnenRSj_9WUu_</recordid><startdate>20240103</startdate><enddate>20240103</enddate><creator>Verma, Vinay</creator><creator>Sanny, Dween</creator><creator>Singh, Abhishek</creator><creator>Gupta, Deepak</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240103</creationdate><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><author>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Algorithms</topic><topic>Annotations</topic><topic>Applications</topic><topic>Commercial / retail</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Robotics</topic><topic>Soft sensors</topic><topic>Training</topic><topic>Vision + language and/or other modalities</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Verma, Vinay</creatorcontrib><creatorcontrib>Sanny, Dween</creatorcontrib><creatorcontrib>Singh, Abhishek</creatorcontrib><creatorcontrib>Gupta, Deepak</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Verma, Vinay</au><au>Sanny, Dween</au><au>Singh, Abhishek</au><au>Gupta, Deepak</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>CoD: Coherent Detection of Entities from Images with Multiple Modalities</atitle><btitle>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</btitle><stitle>WACV</stitle><date>2024-01-03</date><risdate>2024</risdate><spage>8000</spage><epage>8009</epage><pages>8000-8009</pages><eissn>2642-9381</eissn><eisbn>9798350318920</eisbn><coden>IEEPAD</coden><abstract>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</abstract><pub>IEEE</pub><doi>10.1109/WACV57701.2024.00783</doi><tpages>10</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2642-9381
ispartof	2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.8000-8009
issn	2642-9381
language	eng
recordid	cdi_ieee_primary_10484189
source	IEEE Xplore All Conference Series
subjects	Adaptation models Algorithms Annotations Applications Commercial / retail Computational modeling Computer vision Robotics Soft sensors Training Vision + language and/or other modalities Visualization
title	CoD: Coherent Detection of Entities from Images with Multiple Modalities
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T08%3A23%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=CoD:%20Coherent%20Detection%20of%20Entities%20from%20Images%20with%20Multiple%20Modalities&rft.btitle=2024%20IEEE/CVF%20Winter%20Conference%20on%20Applications%20of%20Computer%20Vision%20(WACV)&rft.au=Verma,%20Vinay&rft.date=2024-01-03&rft.spage=8000&rft.epage=8009&rft.pages=8000-8009&rft.eissn=2642-9381&rft.coden=IEEPAD&rft_id=info:doi/10.1109/WACV57701.2024.00783&rft.eisbn=9798350318920&rft_dat=%3Cieee_CHZPO%3E10484189%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10484189&rfr_iscdi=true