Loading…
CoD: Coherent Detection of Entities from Images with Multiple Modalities
Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging trad...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 8009 |
container_issue | |
container_start_page | 8000 |
container_title | |
container_volume | |
creator | Verma, Vinay Sanny, Dween Singh, Abhishek Gupta, Deepak |
description | Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively. |
doi_str_mv | 10.1109/WACV57701.2024.00783 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10484189</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10484189</ieee_id><sourcerecordid>10484189</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733</originalsourceid><addsrcrecordid>eNotz81OwzAQBGCDhEQpfYMe_AIpu17HjrlVaaGVWnHh51g56YYa5adKjBBvTwScZg6jkT4h5ggLRHB3b8v8NbUWcKFA6QWAzehCzJx1GaVAmDkFl2KijFaJowyvxc0wfACQQ0cTscm71b3MuxP33Ea54shlDF0ru0qu2xhi4EFWfdfIbePfx_4V4knuP-sYzjXLfXf09e_oVlxVvh549p9T8fKwfs43ye7pcZsvd0lAdDEpj-RLYwrgrEJXmLTShHaEFF6npnTg0CoyhhQWmrgYZRUj-UIZa8gSTcX87zcw8-Hch8b33wcEnenRSj_9WUu_</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><source>IEEE Xplore All Conference Series</source><creator>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</creator><creatorcontrib>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</creatorcontrib><description>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</description><identifier>EISSN: 2642-9381</identifier><identifier>EISBN: 9798350318920</identifier><identifier>DOI: 10.1109/WACV57701.2024.00783</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Algorithms ; Annotations ; Applications ; Commercial / retail ; Computational modeling ; Computer vision ; Robotics ; Soft sensors ; Training ; Vision + language and/or other modalities ; Visualization</subject><ispartof>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.8000-8009</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10484189$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27907,54537,54914</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10484189$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Verma, Vinay</creatorcontrib><creatorcontrib>Sanny, Dween</creatorcontrib><creatorcontrib>Singh, Abhishek</creatorcontrib><creatorcontrib>Gupta, Deepak</creatorcontrib><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><title>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</title><addtitle>WACV</addtitle><description>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</description><subject>Adaptation models</subject><subject>Algorithms</subject><subject>Annotations</subject><subject>Applications</subject><subject>Commercial / retail</subject><subject>Computational modeling</subject><subject>Computer vision</subject><subject>Robotics</subject><subject>Soft sensors</subject><subject>Training</subject><subject>Vision + language and/or other modalities</subject><subject>Visualization</subject><issn>2642-9381</issn><isbn>9798350318920</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotz81OwzAQBGCDhEQpfYMe_AIpu17HjrlVaaGVWnHh51g56YYa5adKjBBvTwScZg6jkT4h5ggLRHB3b8v8NbUWcKFA6QWAzehCzJx1GaVAmDkFl2KijFaJowyvxc0wfACQQ0cTscm71b3MuxP33Ea54shlDF0ru0qu2xhi4EFWfdfIbePfx_4V4knuP-sYzjXLfXf09e_oVlxVvh549p9T8fKwfs43ye7pcZsvd0lAdDEpj-RLYwrgrEJXmLTShHaEFF6npnTg0CoyhhQWmrgYZRUj-UIZa8gSTcX87zcw8-Hch8b33wcEnenRSj_9WUu_</recordid><startdate>20240103</startdate><enddate>20240103</enddate><creator>Verma, Vinay</creator><creator>Sanny, Dween</creator><creator>Singh, Abhishek</creator><creator>Gupta, Deepak</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240103</creationdate><title>CoD: Coherent Detection of Entities from Images with Multiple Modalities</title><author>Verma, Vinay ; Sanny, Dween ; Singh, Abhishek ; Gupta, Deepak</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Algorithms</topic><topic>Annotations</topic><topic>Applications</topic><topic>Commercial / retail</topic><topic>Computational modeling</topic><topic>Computer vision</topic><topic>Robotics</topic><topic>Soft sensors</topic><topic>Training</topic><topic>Vision + language and/or other modalities</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Verma, Vinay</creatorcontrib><creatorcontrib>Sanny, Dween</creatorcontrib><creatorcontrib>Singh, Abhishek</creatorcontrib><creatorcontrib>Gupta, Deepak</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Verma, Vinay</au><au>Sanny, Dween</au><au>Singh, Abhishek</au><au>Gupta, Deepak</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>CoD: Coherent Detection of Entities from Images with Multiple Modalities</atitle><btitle>2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</btitle><stitle>WACV</stitle><date>2024-01-03</date><risdate>2024</risdate><spage>8000</spage><epage>8009</epage><pages>8000-8009</pages><eissn>2642-9381</eissn><eisbn>9798350318920</eisbn><coden>IEEPAD</coden><abstract>Object detection is a fundamental problem in computer vision, whose research has primarily focused on unimodal models, solely operating on visual data. However, in many real-world applications, data from multiple modalities may be available, such as text accompanying the visual data. Leveraging traditional models on these multi-modal data sources may lead to difficulties in accurately delineating object boundaries. For example, in a document containing a combination of text and images, the model must encompass the images and texts pertaining to the same object in a single bounding box. To address this, we propose a model that takes in multi-scale image features, text extracted through OCR, and 2D positional embeddings of words as inputs, and returns bounding boxes that incorporate the image and associated description as single entities. Furthermore, to address the challenge posed by the irregular arrangement of images and their corresponding textual descriptions, we propose the concept of a "Negative Product Bounding Box" (PBB). This box encapsulates instances where the model faces confusion and tends to predict incorrect bounding boxes. To enhance the model's performance, we incorporate these negative boxes into the loss function governing matching and classification. Additionally, a domain adaptation model is proposed to handle scenarios involving a domain gap between training and test samples. In order to assess the effectiveness of our model, we construct a multimodal dataset comprising product descriptions from online retailers' catalogs. On this dataset, our proposed model demonstrates significant improvements of 27.2%, 4.3%, and 1.7% in handling hard negative samples, multi-modal input, and domain shift scenarios, respectively.</abstract><pub>IEEE</pub><doi>10.1109/WACV57701.2024.00783</doi><tpages>10</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2642-9381 |
ispartof | 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, p.8000-8009 |
issn | 2642-9381 |
language | eng |
recordid | cdi_ieee_primary_10484189 |
source | IEEE Xplore All Conference Series |
subjects | Adaptation models Algorithms Annotations Applications Commercial / retail Computational modeling Computer vision Robotics Soft sensors Training Vision + language and/or other modalities Visualization |
title | CoD: Coherent Detection of Entities from Images with Multiple Modalities |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T08%3A23%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=CoD:%20Coherent%20Detection%20of%20Entities%20from%20Images%20with%20Multiple%20Modalities&rft.btitle=2024%20IEEE/CVF%20Winter%20Conference%20on%20Applications%20of%20Computer%20Vision%20(WACV)&rft.au=Verma,%20Vinay&rft.date=2024-01-03&rft.spage=8000&rft.epage=8009&rft.pages=8000-8009&rft.eissn=2642-9381&rft.coden=IEEPAD&rft_id=info:doi/10.1109/WACV57701.2024.00783&rft.eisbn=9798350318920&rft_dat=%3Cieee_CHZPO%3E10484189%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-cd3ac66b0e8f19b65f4317109ba456c909172366321b43eb770fe13ab26763733%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10484189&rfr_iscdi=true |