Loading…

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantical...

Full description

Saved in:

Bibliographic Details
Published in:	Multimedia systems 2024-02, Vol.30 (1), Article 34
Main Authors:	Li, Zaike, Liu, Li, Zhang, Huaxiang, Liu, Dongmei, Song, Yu, Li, Boqun
Format:	Article
Language:	English
Subjects:	Alignment Computer Communication Networks Computer Graphics Computer Science Controllability Cryptology Data Storage Representation Image processing Linguistics Multimedia Information Systems Operating Systems Regular Paper Similarity
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743
container_end_page
container_issue	1
container_start_page
container_title	Multimedia systems
container_volume	30
creator	Li, Zaike Liu, Li Zhang, Huaxiang Liu, Dongmei Song, Yu Li, Boqun
description	Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.
doi_str_mv	10.1007/s00530-023-01222-7
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2916427917</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2916427917</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</originalsourceid><addsrcrecordid>eNp9kM1KxDAQx4MouK6-gKeA5-gkbdPmKOIXLHjRc0jTaanGZE2yftx8B9_QJ7G7K3gTBuYwv__M8CPkmMMpB6jPEkBVAANRMOBCCFbvkBkvC8F404hdMgNVClYqKfbJQUqPALyWBczIuAjWOPdBbfA5BudM65B6zG8hPtHWJOxo8PR1TCvjvj-_3OiH1ZjyaGlEZ_I4DY0bB_-MPtM-RJrxPbMc2PhsBqQDeowb7JDs9cYlPPrtc_JwdXl_ccMWd9e3F-cLZkUNmWFVQ9uA7bFrwXaqlcpUTSf7CqpSKotCVR02pRSFQTVV18tGtqox0Nq-Los5OdnuXcbwssKU9WNYRT-d1EJxWYpa8XqixJayMaQUsdfLOH0cPzQHvVaqt0r1pFRvlOp1qNiG0gT7AePf6n9SP8JRfQg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2916427917</pqid></control><display><type>article</type><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><source>Springer Nature</source><creator>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</creator><creatorcontrib>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</creatorcontrib><description>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-023-01222-7</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Alignment ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Controllability ; Cryptology ; Data Storage Representation ; Image processing ; Linguistics ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Similarity</subject><ispartof>Multimedia systems, 2024-02, Vol.30 (1), Article 34</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Li, Zaike</creatorcontrib><creatorcontrib>Liu, Li</creatorcontrib><creatorcontrib>Zhang, Huaxiang</creatorcontrib><creatorcontrib>Liu, Dongmei</creatorcontrib><creatorcontrib>Song, Yu</creatorcontrib><creatorcontrib>Li, Boqun</creatorcontrib><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</description><subject>Alignment</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Controllability</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Image processing</subject><subject>Linguistics</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Similarity</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KxDAQx4MouK6-gKeA5-gkbdPmKOIXLHjRc0jTaanGZE2yftx8B9_QJ7G7K3gTBuYwv__M8CPkmMMpB6jPEkBVAANRMOBCCFbvkBkvC8F404hdMgNVClYqKfbJQUqPALyWBczIuAjWOPdBbfA5BudM65B6zG8hPtHWJOxo8PR1TCvjvj-_3OiH1ZjyaGlEZ_I4DY0bB_-MPtM-RJrxPbMc2PhsBqQDeowb7JDs9cYlPPrtc_JwdXl_ccMWd9e3F-cLZkUNmWFVQ9uA7bFrwXaqlcpUTSf7CqpSKotCVR02pRSFQTVV18tGtqox0Nq-Los5OdnuXcbwssKU9WNYRT-d1EJxWYpa8XqixJayMaQUsdfLOH0cPzQHvVaqt0r1pFRvlOp1qNiG0gT7AePf6n9SP8JRfQg</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Li, Zaike</creator><creator>Liu, Li</creator><creator>Zhang, Huaxiang</creator><creator>Liu, Dongmei</creator><creator>Song, Yu</creator><creator>Li, Boqun</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20240201</creationdate><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><author>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Alignment</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Controllability</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Image processing</topic><topic>Linguistics</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Similarity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Zaike</creatorcontrib><creatorcontrib>Liu, Li</creatorcontrib><creatorcontrib>Zhang, Huaxiang</creatorcontrib><creatorcontrib>Liu, Dongmei</creatorcontrib><creatorcontrib>Song, Yu</creatorcontrib><creatorcontrib>Li, Boqun</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Zaike</au><au>Liu, Li</au><au>Zhang, Huaxiang</au><au>Liu, Dongmei</au><au>Song, Yu</au><au>Li, Boqun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>30</volume><issue>1</issue><artnum>34</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-023-01222-7</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 0942-4962
ispartof	Multimedia systems, 2024-02, Vol.30 (1), Article 34
issn	0942-4962 1432-1882
language	eng
recordid	cdi_proquest_journals_2916427917
source	Springer Nature
subjects	Alignment Computer Communication Networks Computer Graphics Computer Science Controllability Cryptology Data Storage Representation Image processing Linguistics Multimedia Information Systems Operating Systems Regular Paper Similarity
title	Locally controllable network based on visual–linguistic relation alignment for text-to-image generation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T04%3A54%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Locally%20controllable%20network%20based%20on%20visual%E2%80%93linguistic%20relation%20alignment%20for%20text-to-image%20generation&rft.jtitle=Multimedia%20systems&rft.au=Li,%20Zaike&rft.date=2024-02-01&rft.volume=30&rft.issue=1&rft.artnum=34&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-023-01222-7&rft_dat=%3Cproquest_cross%3E2916427917%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2916427917&rft_id=info:pmid/&rfr_iscdi=true