Loading…
Locally controllable network based on visual–linguistic relation alignment for text-to-image generation
Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantical...
Saved in:
Published in: | Multimedia systems 2024-02, Vol.30 (1), Article 34 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743 |
container_end_page | |
container_issue | 1 |
container_start_page | |
container_title | Multimedia systems |
container_volume | 30 |
creator | Li, Zaike Liu, Li Zhang, Huaxiang Liu, Dongmei Song, Yu Li, Boqun |
description | Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image. |
doi_str_mv | 10.1007/s00530-023-01222-7 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2916427917</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2916427917</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</originalsourceid><addsrcrecordid>eNp9kM1KxDAQx4MouK6-gKeA5-gkbdPmKOIXLHjRc0jTaanGZE2yftx8B9_QJ7G7K3gTBuYwv__M8CPkmMMpB6jPEkBVAANRMOBCCFbvkBkvC8F404hdMgNVClYqKfbJQUqPALyWBczIuAjWOPdBbfA5BudM65B6zG8hPtHWJOxo8PR1TCvjvj-_3OiH1ZjyaGlEZ_I4DY0bB_-MPtM-RJrxPbMc2PhsBqQDeowb7JDs9cYlPPrtc_JwdXl_ccMWd9e3F-cLZkUNmWFVQ9uA7bFrwXaqlcpUTSf7CqpSKotCVR02pRSFQTVV18tGtqox0Nq-Los5OdnuXcbwssKU9WNYRT-d1EJxWYpa8XqixJayMaQUsdfLOH0cPzQHvVaqt0r1pFRvlOp1qNiG0gT7AePf6n9SP8JRfQg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2916427917</pqid></control><display><type>article</type><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><source>Springer Nature</source><creator>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</creator><creatorcontrib>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</creatorcontrib><description>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</description><identifier>ISSN: 0942-4962</identifier><identifier>EISSN: 1432-1882</identifier><identifier>DOI: 10.1007/s00530-023-01222-7</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Alignment ; Computer Communication Networks ; Computer Graphics ; Computer Science ; Controllability ; Cryptology ; Data Storage Representation ; Image processing ; Linguistics ; Multimedia Information Systems ; Operating Systems ; Regular Paper ; Similarity</subject><ispartof>Multimedia systems, 2024-02, Vol.30 (1), Article 34</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Li, Zaike</creatorcontrib><creatorcontrib>Liu, Li</creatorcontrib><creatorcontrib>Zhang, Huaxiang</creatorcontrib><creatorcontrib>Liu, Dongmei</creatorcontrib><creatorcontrib>Song, Yu</creatorcontrib><creatorcontrib>Li, Boqun</creatorcontrib><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><title>Multimedia systems</title><addtitle>Multimedia Systems</addtitle><description>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</description><subject>Alignment</subject><subject>Computer Communication Networks</subject><subject>Computer Graphics</subject><subject>Computer Science</subject><subject>Controllability</subject><subject>Cryptology</subject><subject>Data Storage Representation</subject><subject>Image processing</subject><subject>Linguistics</subject><subject>Multimedia Information Systems</subject><subject>Operating Systems</subject><subject>Regular Paper</subject><subject>Similarity</subject><issn>0942-4962</issn><issn>1432-1882</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KxDAQx4MouK6-gKeA5-gkbdPmKOIXLHjRc0jTaanGZE2yftx8B9_QJ7G7K3gTBuYwv__M8CPkmMMpB6jPEkBVAANRMOBCCFbvkBkvC8F404hdMgNVClYqKfbJQUqPALyWBczIuAjWOPdBbfA5BudM65B6zG8hPtHWJOxo8PR1TCvjvj-_3OiH1ZjyaGlEZ_I4DY0bB_-MPtM-RJrxPbMc2PhsBqQDeowb7JDs9cYlPPrtc_JwdXl_ccMWd9e3F-cLZkUNmWFVQ9uA7bFrwXaqlcpUTSf7CqpSKotCVR02pRSFQTVV18tGtqox0Nq-Los5OdnuXcbwssKU9WNYRT-d1EJxWYpa8XqixJayMaQUsdfLOH0cPzQHvVaqt0r1pFRvlOp1qNiG0gT7AePf6n9SP8JRfQg</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Li, Zaike</creator><creator>Liu, Li</creator><creator>Zhang, Huaxiang</creator><creator>Liu, Dongmei</creator><creator>Song, Yu</creator><creator>Li, Boqun</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20240201</creationdate><title>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</title><author>Li, Zaike ; Liu, Li ; Zhang, Huaxiang ; Liu, Dongmei ; Song, Yu ; Li, Boqun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Alignment</topic><topic>Computer Communication Networks</topic><topic>Computer Graphics</topic><topic>Computer Science</topic><topic>Controllability</topic><topic>Cryptology</topic><topic>Data Storage Representation</topic><topic>Image processing</topic><topic>Linguistics</topic><topic>Multimedia Information Systems</topic><topic>Operating Systems</topic><topic>Regular Paper</topic><topic>Similarity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Zaike</creatorcontrib><creatorcontrib>Liu, Li</creatorcontrib><creatorcontrib>Zhang, Huaxiang</creatorcontrib><creatorcontrib>Liu, Dongmei</creatorcontrib><creatorcontrib>Song, Yu</creatorcontrib><creatorcontrib>Li, Boqun</creatorcontrib><collection>CrossRef</collection><jtitle>Multimedia systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Zaike</au><au>Liu, Li</au><au>Zhang, Huaxiang</au><au>Liu, Dongmei</au><au>Song, Yu</au><au>Li, Boqun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Locally controllable network based on visual–linguistic relation alignment for text-to-image generation</atitle><jtitle>Multimedia systems</jtitle><stitle>Multimedia Systems</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>30</volume><issue>1</issue><artnum>34</artnum><issn>0942-4962</issn><eissn>1432-1882</eissn><abstract>Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00530-023-01222-7</doi></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0942-4962 |
ispartof | Multimedia systems, 2024-02, Vol.30 (1), Article 34 |
issn | 0942-4962 1432-1882 |
language | eng |
recordid | cdi_proquest_journals_2916427917 |
source | Springer Nature |
subjects | Alignment Computer Communication Networks Computer Graphics Computer Science Controllability Cryptology Data Storage Representation Image processing Linguistics Multimedia Information Systems Operating Systems Regular Paper Similarity |
title | Locally controllable network based on visual–linguistic relation alignment for text-to-image generation |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T04%3A54%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Locally%20controllable%20network%20based%20on%20visual%E2%80%93linguistic%20relation%20alignment%20for%20text-to-image%20generation&rft.jtitle=Multimedia%20systems&rft.au=Li,%20Zaike&rft.date=2024-02-01&rft.volume=30&rft.issue=1&rft.artnum=34&rft.issn=0942-4962&rft.eissn=1432-1882&rft_id=info:doi/10.1007/s00530-023-01222-7&rft_dat=%3Cproquest_cross%3E2916427917%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c270t-e570b80cfedb0cd9b69a58d6f505469ce295de84623ae9ae9df686b98a0bcf743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2916427917&rft_id=info:pmid/&rfr_iscdi=true |