Loading…

CoD: Coherent Detection of Entities from Images with Multiple Modalities

Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging trad...

Full description

Saved in:
Bibliographic Details
Main Authors: Verma, Vinay, Sanny, Dween, Singh, Abhishek, Gupta, Deepak
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 8009
container_issue
container_start_page 8000
container_title
container_volume
creator Verma, Vinay
Sanny, Dween
Singh, Abhishek
Gupta, Deepak
description Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.
doi_str_mv 10.1109/WACV57701.2024.00783
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10484189</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10484189</ieee_id><sourcerecordid>10484189</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733</originalsourceid><addsrcrecordid>eNotz81OwzAQBGCDhEQpfYMe_AIpu17HjrlVaaGVWnHh51g56YYa5adKjBBvTwScZg6jkT4h5ggLRHB3b8v8NbUWcKFA6QWAzehCzJx1GaVAmDkFl2KijFaJowyvxc0wfACQQ0cTscm71b3MuxP33Ea54shlDF0ru0qu2xhi4EFWfdfIbePfx_4V4knuP-sYzjXLfXf09e_oVlxVvh549p9T8fKwfs43ye7pcZsvd0lAdDEpj-RLYwrgrEJXmLTShHaEFF6npnTg0CoyhhQWmrgYZRUj-UIZa8gSTcX87zcw8-Hch8b33wcEnenRSj_9WUu_</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><source>IEEE Xplore All Conference Series</source><creator>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</creator><creatorcontrib>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</creatorcontrib><description>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</description><identifier>EISSN: 2642-9381</identifier><identifier>EISBN: 9798350318920</identifier><identifier>DOI: 10.1109/WACV57701.2024.00783</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Algorithms ; Annotations ; Applications ; Commercial / retail ; Computational modeling ; Computer vision ; Robotics ; Soft sensors ; Training ; Vision + language and/or other modalities ; Visualization</subject><ispartof>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.8000-8009</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10484189$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27907,54537,54914</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10484189$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Verma, Vinay</creatorcontrib><creatorcontrib>Sanny, Dween</creatorcontrib><creatorcontrib>Singh, Abhishek</creatorcontrib><creatorcontrib>Gupta, Deepak</creatorcontrib><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><title>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</title><addtitle>WACV</addtitle><description>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</description><subject>Adaptation models</subject><subject>Algorithms</subject><subject>Annotations</subject><subject>Applications</subject><subject>Commercial / retail</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Robotics</subject><subject>Soft sensors</subject><subject>Training</subject><subject>Vision + language and/or other modalities</subject><subject>Visualization</subject><issn>2642-9381</issn><isbn>9798350318920</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotz81OwzAQBGCDhEQpfYMe_AIpu17HjrlVaaGVWnHh51g56YYa5adKjBBvTwScZg6jkT4h5ggLRHB3b8v8NbUWcKFA6QWAzehCzJx1GaVAmDkFl2KijFaJowyvxc0wfACQQ0cTscm71b3MuxP33Ea54shlDF0ru0qu2xhi4EFWfdfIbePfx_4V4knuP-sYzjXLfXf09e_oVlxVvh549p9T8fKwfs43ye7pcZsvd0lAdDEpj-RLYwrgrEJXmLTShHaEFF6npnTg0CoyhhQWmrgYZRUj-UIZa8gSTcX87zcw8-Hch8b33wcEnenRSj_9WUu_</recordid><startdate>20240103</startdate><enddate>20240103</enddate><creator>Verma, Vinay</creator><creator>Sanny, Dween</creator><creator>Singh, Abhishek</creator><creator>Gupta, Deepak</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240103</creationdate><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><author>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Algorithms</topic><topic>Annotations</topic><topic>Applications</topic><topic>Commercial / retail</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Robotics</topic><topic>Soft sensors</topic><topic>Training</topic><topic>Vision + language and/or other modalities</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Verma, Vinay</creatorcontrib><creatorcontrib>Sanny, Dween</creatorcontrib><creatorcontrib>Singh, Abhishek</creatorcontrib><creatorcontrib>Gupta, Deepak</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Verma, Vinay</au><au>Sanny, Dween</au><au>Singh, Abhishek</au><au>Gupta, Deepak</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>CoD: Coherent Detection of Entities from Images with Multiple Modalities</atitle><btitle>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</btitle><stitle>WACV</stitle><date>2024-01-03</date><risdate>2024</risdate><spage>8000</spage><epage>8009</epage><pages>8000-8009</pages><eissn>2642-9381</eissn><eisbn>9798350318920</eisbn><coden>IEEPAD</coden><abstract>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</abstract><pub>IEEE</pub><doi>10.1109/WACV57701.2024.00783</doi><tpages>10</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2642-9381
ispartof 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.8000-8009
issn 2642-9381
language eng
recordid cdi_ieee_primary_10484189
source IEEE Xplore All Conference Series
subjects Adaptation models
Algorithms
Annotations
Applications
Commercial / retail
Computational modeling
Computer vision
Robotics
Soft sensors
Training
Vision + language and/or other modalities
Visualization
title CoD: Coherent Detection of Entities from Images with Multiple Modalities
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T08%3A23%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=CoD:%20Coherent%20Detection%20of%20Entities%20from%20Images%20with%20Multiple%20Modalities&rft.btitle=2024%20IEEE/CVF%20Winter%20Conference%20on%20Applications%20of%20Computer%20Vision%20(WACV)&rft.au=Verma,%20Vinay&rft.date=2024-01-03&rft.spage=8000&rft.epage=8009&rft.pages=8000-8009&rft.eissn=2642-9381&rft.coden=IEEPAD&rft_id=info:doi/10.1109/WACV57701.2024.00783&rft.eisbn=9798350318920&rft_dat=%3Cieee_CHZPO%3E10484189%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10484189&rfr_iscdi=true