Loading…

Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantical...

Full description

Saved in:
Bibliographic Details
Published in:Multimedia systems 2024-02, Vol.30 (1), Article 34
Main Authors: Li, Zaike, Liu, Li, Zhang, Huaxiang, Liu, Dongmei, Song, Yu, Li, Boqun
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743
container_end_page
container_issue 1
container_start_page
container_title Multimedia systems
container_volume 30
creator Li, Zaike
Liu, Li
Zhang, Huaxiang
Liu, Dongmei
Song, Yu
Li, Boqun
description Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.
doi_str_mv 10.1007/s00530-023-01222-7
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2916427917</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2916427917</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</originalsourceid><addsrcrecordid>eNp9kM1KxDAQx4MouK6-gKeA5-gkbdPmKOIXLHjRc0jTaanGZE2yftx8B9_QJ7G7K3gTBuYwv__M8CPkmMMpB6jPEkBVAANRMOBCCFbvkBkvC8F404hdMgNVClYqKfbJQUqPALyWBczIuAjWOPdBbfA5BudM65B6zG8hPtHWJOxo8PR1TCvjvj-_3OiH1ZjyaGlEZ_I4DY0bB_-MPtM-RJrxPbMc2PhsBqQDeowb7JDs9cYlPPrtc_JwdXl_ccMWd9e3F-cLZkUNmWFVQ9uA7bFrwXaqlcpUTSf7CqpSKotCVR02pRSFQTVV18tGtqox0Nq-Los5OdnuXcbwssKU9WNYRT-d1EJxWYpa8XqixJayMaQUsdfLOH0cPzQHvVaqt0r1pFRvlOp1qNiG0gT7AePf6n9SP8JRfQg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2916427917</pqid></control><display><type>article</type><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><source>Springer Nature</source><creator>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</creator><creatorcontrib>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</creatorcontrib><description>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-023-01222-7</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Alignment ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Controllability ; Cryptology ; Data Storage Representation ; Image processing ; Linguistics ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Similarity</subject><ispartof>Multimedia systems, 2024-02, Vol.30 (1), Article 34</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Li, Zaike</creatorcontrib><creatorcontrib>Liu, Li</creatorcontrib><creatorcontrib>Zhang, Huaxiang</creatorcontrib><creatorcontrib>Liu, Dongmei</creatorcontrib><creatorcontrib>Song, Yu</creatorcontrib><creatorcontrib>Li, Boqun</creatorcontrib><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</description><subject>Alignment</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Controllability</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Image processing</subject><subject>Linguistics</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Similarity</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KxDAQx4MouK6-gKeA5-gkbdPmKOIXLHjRc0jTaanGZE2yftx8B9_QJ7G7K3gTBuYwv__M8CPkmMMpB6jPEkBVAANRMOBCCFbvkBkvC8F404hdMgNVClYqKfbJQUqPALyWBczIuAjWOPdBbfA5BudM65B6zG8hPtHWJOxo8PR1TCvjvj-_3OiH1ZjyaGlEZ_I4DY0bB_-MPtM-RJrxPbMc2PhsBqQDeowb7JDs9cYlPPrtc_JwdXl_ccMWd9e3F-cLZkUNmWFVQ9uA7bFrwXaqlcpUTSf7CqpSKotCVR02pRSFQTVV18tGtqox0Nq-Los5OdnuXcbwssKU9WNYRT-d1EJxWYpa8XqixJayMaQUsdfLOH0cPzQHvVaqt0r1pFRvlOp1qNiG0gT7AePf6n9SP8JRfQg</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Li, Zaike</creator><creator>Liu, Li</creator><creator>Zhang, Huaxiang</creator><creator>Liu, Dongmei</creator><creator>Song, Yu</creator><creator>Li, Boqun</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20240201</creationdate><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><author>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Alignment</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Controllability</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Image processing</topic><topic>Linguistics</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Similarity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Zaike</creatorcontrib><creatorcontrib>Liu, Li</creatorcontrib><creatorcontrib>Zhang, Huaxiang</creatorcontrib><creatorcontrib>Liu, Dongmei</creatorcontrib><creatorcontrib>Song, Yu</creatorcontrib><creatorcontrib>Li, Boqun</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Zaike</au><au>Liu, Li</au><au>Zhang, Huaxiang</au><au>Liu, Dongmei</au><au>Song, Yu</au><au>Li, Boqun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>30</volume><issue>1</issue><artnum>34</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-023-01222-7</doi></addata></record>
fulltext fulltext
identifier ISSN: 0942-4962
ispartof Multimedia systems, 2024-02, Vol.30 (1), Article 34
issn 0942-4962
1432-1882
language eng
recordid cdi_proquest_journals_2916427917
source Springer Nature
subjects Alignment
Computer Communication Networks
Computer Graphics
Computer Science
Controllability
Cryptology
Data Storage Representation
Image processing
Linguistics
Multimedia Information Systems
Operating Systems
Regular Paper
Similarity
title Locally controllable network based on visual–linguistic relation alignment for text-to-image generation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T04%3A54%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Locally%20controllable%20network%20based%20on%20visual%E2%80%93linguistic%20relation%20alignment%20for%20text-to-image%20generation&rft.jtitle=Multimedia%20systems&rft.au=Li,%20Zaike&rft.date=2024-02-01&rft.volume=30&rft.issue=1&rft.artnum=34&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-023-01222-7&rft_dat=%3Cproquest_cross%3E2916427917%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2916427917&rft_id=info:pmid/&rfr_iscdi=true