Loading…

Vision-language integration using constrained local semantic features

•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display...

Full description

Saved in:
Bibliographic Details
Published in:Computer vision and image understanding 2017-10, Vol.163, p.41-57
Main Authors: Tamaazousti, Youssef, Le Borgne, Hervé, Popescu, Adrian, Gadeski, Etienne, Ginsca, Alexandru, Hudelot, Céline
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3
cites cdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3
container_end_page 57
container_issue
container_start_page 41
container_title Computer vision and image understanding
container_volume 163
creator Tamaazousti, Youssef
Le Borgne, Hervé
Popescu, Adrian
Gadeski, Etienne
Ginsca, Alexandru
Hudelot, Céline
description •Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.
doi_str_mv 10.1016/j.cviu.2017.05.017
format article
fullrecord <record><control><sourceid>hal_cross</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_cea_01803830v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1077314217301121</els_id><sourcerecordid>oai_HAL_cea_01803830v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</originalsourceid><addsrcrecordid>eNp9kEFLw0AQhRdRsFb_gKdcPSTuZJMmAS-lVCsUvKh4W6azs3FLmshuUvDfm1Dx6OkNw_uGeU-IW5AJSFjc7xM6uiFJJRSJzJNRzsQMZCXjVOUf59NcFLGCLL0UVyHspQTIKpiJ9bsLrmvjBtt6wJoj1_Zce-zHZTQE19YRdW3oPbqWTdR0hE0U-IBt7yiyjP3gOVyLC4tN4JtfnYu3x_XrahNvX56eV8ttTEplfWzyknYVl6gWwMbaDNGURLyThJiOz6mqULZKgSoELBRWJi2N3AEgEVir5uLudPcTG_3l3QH9t-7Q6c1yq4lRSyilKpU8wuhNT17yXQie7R8AUk-l6b2eStNTaVrmI1uM0MMJ4jHF0bHXgRy3xMZ5pl6bzv2H_wDpKXdl</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Vision-language integration using constrained local semantic features</title><source>Elsevier</source><creator>Tamaazousti, Youssef ; Le Borgne, Hervé ; Popescu, Adrian ; Gadeski, Etienne ; Ginsca, Alexandru ; Hudelot, Céline</creator><creatorcontrib>Tamaazousti, Youssef ; Le Borgne, Hervé ; Popescu, Adrian ; Gadeski, Etienne ; Ginsca, Alexandru ; Hudelot, Céline</creatorcontrib><description>•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.</description><identifier>ISSN: 1077-3142</identifier><identifier>EISSN: 1090-235X</identifier><identifier>DOI: 10.1016/j.cviu.2017.05.017</identifier><language>eng</language><publisher>Elsevier Inc</publisher><subject>Bi-modal classification ; Common latent space ; Computer Science ; Computer Vision and Pattern Recognition ; Concept-based sparsification ; Constrained local regions ; Image classification ; Image retrieval ; Pure concept space ; Semantic features ; Vision-language integration</subject><ispartof>Computer vision and image understanding, 2017-10, Vol.163, p.41-57</ispartof><rights>2017 Elsevier Inc.</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</citedby><cites>FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</cites><orcidid>0000-0003-3849-4133 ; 0000-0003-0520-8436</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27922,27923</link.rule.ids><backlink>$$Uhttps://cea.hal.science/cea-01803830$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Tamaazousti, Youssef</creatorcontrib><creatorcontrib>Le Borgne, Hervé</creatorcontrib><creatorcontrib>Popescu, Adrian</creatorcontrib><creatorcontrib>Gadeski, Etienne</creatorcontrib><creatorcontrib>Ginsca, Alexandru</creatorcontrib><creatorcontrib>Hudelot, Céline</creatorcontrib><title>Vision-language integration using constrained local semantic features</title><title>Computer vision and image understanding</title><description>•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.</description><subject>Bi-modal classification</subject><subject>Common latent space</subject><subject>Computer Science</subject><subject>Computer Vision and Pattern Recognition</subject><subject>Concept-based sparsification</subject><subject>Constrained local regions</subject><subject>Image classification</subject><subject>Image retrieval</subject><subject>Pure concept space</subject><subject>Semantic features</subject><subject>Vision-language integration</subject><issn>1077-3142</issn><issn>1090-235X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNp9kEFLw0AQhRdRsFb_gKdcPSTuZJMmAS-lVCsUvKh4W6azs3FLmshuUvDfm1Dx6OkNw_uGeU-IW5AJSFjc7xM6uiFJJRSJzJNRzsQMZCXjVOUf59NcFLGCLL0UVyHspQTIKpiJ9bsLrmvjBtt6wJoj1_Zce-zHZTQE19YRdW3oPbqWTdR0hE0U-IBt7yiyjP3gOVyLC4tN4JtfnYu3x_XrahNvX56eV8ttTEplfWzyknYVl6gWwMbaDNGURLyThJiOz6mqULZKgSoELBRWJi2N3AEgEVir5uLudPcTG_3l3QH9t-7Q6c1yq4lRSyilKpU8wuhNT17yXQie7R8AUk-l6b2eStNTaVrmI1uM0MMJ4jHF0bHXgRy3xMZ5pl6bzv2H_wDpKXdl</recordid><startdate>201710</startdate><enddate>201710</enddate><creator>Tamaazousti, Youssef</creator><creator>Le Borgne, Hervé</creator><creator>Popescu, Adrian</creator><creator>Gadeski, Etienne</creator><creator>Ginsca, Alexandru</creator><creator>Hudelot, Céline</creator><general>Elsevier Inc</general><general>Elsevier</general><scope>AAYXX</scope><scope>CITATION</scope><scope>1XC</scope><orcidid>https://orcid.org/0000-0003-3849-4133</orcidid><orcidid>https://orcid.org/0000-0003-0520-8436</orcidid></search><sort><creationdate>201710</creationdate><title>Vision-language integration using constrained local semantic features</title><author>Tamaazousti, Youssef ; Le Borgne, Hervé ; Popescu, Adrian ; Gadeski, Etienne ; Ginsca, Alexandru ; Hudelot, Céline</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Bi-modal classification</topic><topic>Common latent space</topic><topic>Computer Science</topic><topic>Computer Vision and Pattern Recognition</topic><topic>Concept-based sparsification</topic><topic>Constrained local regions</topic><topic>Image classification</topic><topic>Image retrieval</topic><topic>Pure concept space</topic><topic>Semantic features</topic><topic>Vision-language integration</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tamaazousti, Youssef</creatorcontrib><creatorcontrib>Le Borgne, Hervé</creatorcontrib><creatorcontrib>Popescu, Adrian</creatorcontrib><creatorcontrib>Gadeski, Etienne</creatorcontrib><creatorcontrib>Ginsca, Alexandru</creatorcontrib><creatorcontrib>Hudelot, Céline</creatorcontrib><collection>CrossRef</collection><collection>Hyper Article en Ligne (HAL)</collection><jtitle>Computer vision and image understanding</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tamaazousti, Youssef</au><au>Le Borgne, Hervé</au><au>Popescu, Adrian</au><au>Gadeski, Etienne</au><au>Ginsca, Alexandru</au><au>Hudelot, Céline</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-language integration using constrained local semantic features</atitle><jtitle>Computer vision and image understanding</jtitle><date>2017-10</date><risdate>2017</risdate><volume>163</volume><spage>41</spage><epage>57</epage><pages>41-57</pages><issn>1077-3142</issn><eissn>1090-235X</eissn><abstract>•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.</abstract><pub>Elsevier Inc</pub><doi>10.1016/j.cviu.2017.05.017</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-3849-4133</orcidid><orcidid>https://orcid.org/0000-0003-0520-8436</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1077-3142
ispartof Computer vision and image understanding, 2017-10, Vol.163, p.41-57
issn 1077-3142
1090-235X
language eng
recordid cdi_hal_primary_oai_HAL_cea_01803830v1
source Elsevier
subjects Bi-modal classification
Common latent space
Computer Science
Computer Vision and Pattern Recognition
Concept-based sparsification
Constrained local regions
Image classification
Image retrieval
Pure concept space
Semantic features
Vision-language integration
title Vision-language integration using constrained local semantic features
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T11%3A56%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-language%20integration%20using%20constrained%20local%20semantic%20features&rft.jtitle=Computer%20vision%20and%20image%20understanding&rft.au=Tamaazousti,%20Youssef&rft.date=2017-10&rft.volume=163&rft.spage=41&rft.epage=57&rft.pages=41-57&rft.issn=1077-3142&rft.eissn=1090-235X&rft_id=info:doi/10.1016/j.cviu.2017.05.017&rft_dat=%3Chal_cross%3Eoai_HAL_cea_01803830v1%3C/hal_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true