Loading…

Vision-language integration using constrained local semantic features

•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display...

Full description

Saved in:

Bibliographic Details
Published in:	Computer vision and image understanding 2017-10, Vol.163, p.41-57
Main Authors:	Tamaazousti, Youssef, Le Borgne, Hervé, Popescu, Adrian, Gadeski, Etienne, Ginsca, Alexandru, Hudelot, Céline
Format:	Article
Language:	English
Subjects:	Bi-modal classification Common latent space Computer Science Computer Vision and Pattern Recognition Concept-based sparsification Constrained local regions Image classification Image retrieval Pure concept space Semantic features Vision-language integration
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3
cites	cdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3
container_end_page	57
container_issue
container_start_page	41
container_title	Computer vision and image understanding
container_volume	163
creator	Tamaazousti, Youssef Le Borgne, Hervé Popescu, Adrian Gadeski, Etienne Ginsca, Alexandru Hudelot, Céline
description	•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.
doi_str_mv	10.1016/j.cviu.2017.05.017
format	article
fullrecord	<record><control><sourceid>hal_cross</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_cea_01803830v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1077314217301121</els_id><sourcerecordid>oai_HAL_cea_01803830v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</originalsourceid><addsrcrecordid>eNp9kEFLw0AQhRdRsFb_gKdcPSTuZJMmAS-lVCsUvKh4W6azs3FLmshuUvDfm1Dx6OkNw_uGeU-IW5AJSFjc7xM6uiFJJRSJzJNRzsQMZCXjVOUf59NcFLGCLL0UVyHspQTIKpiJ9bsLrmvjBtt6wJoj1_Zce-zHZTQE19YRdW3oPbqWTdR0hE0U-IBt7yiyjP3gOVyLC4tN4JtfnYu3x_XrahNvX56eV8ttTEplfWzyknYVl6gWwMbaDNGURLyThJiOz6mqULZKgSoELBRWJi2N3AEgEVir5uLudPcTG_3l3QH9t-7Q6c1yq4lRSyilKpU8wuhNT17yXQie7R8AUk-l6b2eStNTaVrmI1uM0MMJ4jHF0bHXgRy3xMZ5pl6bzv2H_wDpKXdl</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Vision-language integration using constrained local semantic features</title><source>Elsevier</source><creator>Tamaazousti, Youssef ; Le Borgne, Hervé ; Popescu, Adrian ; Gadeski, Etienne ; Ginsca, Alexandru ; Hudelot, Céline</creator><creatorcontrib>Tamaazousti, Youssef ; Le Borgne, Hervé ; Popescu, Adrian ; Gadeski, Etienne ; Ginsca, Alexandru ; Hudelot, Céline</creatorcontrib><description>•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.</description><identifier>ISSN: 1077-3142</identifier><identifier>EISSN: 1090-235X</identifier><identifier>DOI: 10.1016/j.cviu.2017.05.017</identifier><language>eng</language><publisher>Elsevier Inc</publisher><subject>Bi-modal classification ; Common latent space ; Computer Science ; Computer Vision and Pattern Recognition ; Concept-based sparsification ; Constrained local regions ; Image classification ; Image retrieval ; Pure concept space ; Semantic features ; Vision-language integration</subject><ispartof>Computer vision and image understanding, 2017-10, Vol.163, p.41-57</ispartof><rights>2017 Elsevier Inc.</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</citedby><cites>FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</cites><orcidid>0000-0003-3849-4133 ; 0000-0003-0520-8436</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27922,27923</link.rule.ids><backlink>$$Uhttps://cea.hal.science/cea-01803830$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Tamaazousti, Youssef</creatorcontrib><creatorcontrib>Le Borgne, Hervé</creatorcontrib><creatorcontrib>Popescu, Adrian</creatorcontrib><creatorcontrib>Gadeski, Etienne</creatorcontrib><creatorcontrib>Ginsca, Alexandru</creatorcontrib><creatorcontrib>Hudelot, Céline</creatorcontrib><title>Vision-language integration using constrained local semantic features</title><title>Computer vision and image understanding</title><description>•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.</description><subject>Bi-modal classification</subject><subject>Common latent space</subject><subject>Computer Science</subject><subject>Computer Vision and Pattern Recognition</subject><subject>Concept-based sparsification</subject><subject>Constrained local regions</subject><subject>Image classification</subject><subject>Image retrieval</subject><subject>Pure concept space</subject><subject>Semantic features</subject><subject>Vision-language integration</subject><issn>1077-3142</issn><issn>1090-235X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><recordid>eNp9kEFLw0AQhRdRsFb_gKdcPSTuZJMmAS-lVCsUvKh4W6azs3FLmshuUvDfm1Dx6OkNw_uGeU-IW5AJSFjc7xM6uiFJJRSJzJNRzsQMZCXjVOUf59NcFLGCLL0UVyHspQTIKpiJ9bsLrmvjBtt6wJoj1_Zce-zHZTQE19YRdW3oPbqWTdR0hE0U-IBt7yiyjP3gOVyLC4tN4JtfnYu3x_XrahNvX56eV8ttTEplfWzyknYVl6gWwMbaDNGURLyThJiOz6mqULZKgSoELBRWJi2N3AEgEVir5uLudPcTG_3l3QH9t-7Q6c1yq4lRSyilKpU8wuhNT17yXQie7R8AUk-l6b2eStNTaVrmI1uM0MMJ4jHF0bHXgRy3xMZ5pl6bzv2H_wDpKXdl</recordid><startdate>201710</startdate><enddate>201710</enddate><creator>Tamaazousti, Youssef</creator><creator>Le Borgne, Hervé</creator><creator>Popescu, Adrian</creator><creator>Gadeski, Etienne</creator><creator>Ginsca, Alexandru</creator><creator>Hudelot, Céline</creator><general>Elsevier Inc</general><general>Elsevier</general><scope>AAYXX</scope><scope>CITATION</scope><scope>1XC</scope><orcidid>https://orcid.org/0000-0003-3849-4133</orcidid><orcidid>https://orcid.org/0000-0003-0520-8436</orcidid></search><sort><creationdate>201710</creationdate><title>Vision-language integration using constrained local semantic features</title><author>Tamaazousti, Youssef ; Le Borgne, Hervé ; Popescu, Adrian ; Gadeski, Etienne ; Ginsca, Alexandru ; Hudelot, Céline</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Bi-modal classification</topic><topic>Common latent space</topic><topic>Computer Science</topic><topic>Computer Vision and Pattern Recognition</topic><topic>Concept-based sparsification</topic><topic>Constrained local regions</topic><topic>Image classification</topic><topic>Image retrieval</topic><topic>Pure concept space</topic><topic>Semantic features</topic><topic>Vision-language integration</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tamaazousti, Youssef</creatorcontrib><creatorcontrib>Le Borgne, Hervé</creatorcontrib><creatorcontrib>Popescu, Adrian</creatorcontrib><creatorcontrib>Gadeski, Etienne</creatorcontrib><creatorcontrib>Ginsca, Alexandru</creatorcontrib><creatorcontrib>Hudelot, Céline</creatorcontrib><collection>CrossRef</collection><collection>Hyper Article en Ligne (HAL)</collection><jtitle>Computer vision and image understanding</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tamaazousti, Youssef</au><au>Le Borgne, Hervé</au><au>Popescu, Adrian</au><au>Gadeski, Etienne</au><au>Ginsca, Alexandru</au><au>Hudelot, Céline</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Vision-language integration using constrained local semantic features</atitle><jtitle>Computer vision and image understanding</jtitle><date>2017-10</date><risdate>2017</risdate><volume>163</volume><spage>41</spage><epage>57</epage><pages>41-57</pages><issn>1077-3142</issn><eissn>1090-235X</eissn><abstract>•Vision and language integration at two levels, including the semantic level.•A semantic signature that adapts its sparsity to the actual visual content of images.•CNN-based mid-level features boosting semantic signatures.•Top performances on publicly available benchmarks for several tasks. [Display omitted] This paper tackles two recent promising issues in the field of computer vision, namely “the integration of linguistic and visual information” and “the use of semantic features to represent the image content”. Semantic features represent images according to some visual concepts that are detected into the image by a set of base classifiers. Recent works exhibit competitive performances in image classification and retrieval using such features. We propose to rely on this type of image descriptions to facilitate its integration with linguistic data. More precisely, the contribution of this paper is threefold. First, we propose to automatically determine the most useful dimensions of a semantic representation according to the actual image content. Hence, it results into a level of sparsity for the semantic features that is adapted to each image independently. Our model takes into account both the confidence on each base classifier and the global amount of information of the semantic signature, defined in the Shannon sense. This contribution is further extended to better reflect the detection of a visual concept at a local scale. Second, we introduce a new strategy to learn an efficient mid-level representation by CNNs that boosts the performance of semantic signatures. Last, we propose several schemes to integrate a visual representation based on semantic features with some linguistic piece of information, leading to the nesting of linguistic information at two levels of the visual features. Experimental validation is conducted on four benchmarks (VOC 2007, VOC 2012, Nus-Wide and MIT Indoor) for classification, three of them for retrieval and two of them for bi-modal classification. The proposed semantic feature achieves state-of-the-art performances on three classification benchmarks and all retrieval ones. Regarding our vision-language integration method, it achieves state-of-the-art performances in bi-modal classification.</abstract><pub>Elsevier Inc</pub><doi>10.1016/j.cviu.2017.05.017</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-3849-4133</orcidid><orcidid>https://orcid.org/0000-0003-0520-8436</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1077-3142
ispartof	Computer vision and image understanding, 2017-10, Vol.163, p.41-57
issn	1077-3142 1090-235X
language	eng
recordid	cdi_hal_primary_oai_HAL_cea_01803830v1
source	Elsevier
subjects	Bi-modal classification Common latent space Computer Science Computer Vision and Pattern Recognition Concept-based sparsification Constrained local regions Image classification Image retrieval Pure concept space Semantic features Vision-language integration
title	Vision-language integration using constrained local semantic features
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T11%3A56%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Vision-language%20integration%20using%20constrained%20local%20semantic%20features&rft.jtitle=Computer%20vision%20and%20image%20understanding&rft.au=Tamaazousti,%20Youssef&rft.date=2017-10&rft.volume=163&rft.spage=41&rft.epage=57&rft.pages=41-57&rft.issn=1077-3142&rft.eissn=1090-235X&rft_id=info:doi/10.1016/j.cviu.2017.05.017&rft_dat=%3Chal_cross%3Eoai_HAL_cea_01803830v1%3C/hal_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c334t-d58cb9e8a361edff4aad8cceb0caa21423973f921c9a1a73a9d28d0b11acc1ff3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true