Loading…

Text-based person search via cross-modal alignment learning

Text-based person search aims to use text descriptions to search for corresponding person images. However, due to the obvious pattern differences in image and text modalities, it is still a challenging problem to align the two modalities. Most existing approaches only consider semantic alignment wit...

Full description

Saved in:

Bibliographic Details
Published in:	Pattern recognition 2024-08, Vol.152, p.110481, Article 110481
Main Authors:	Ke, Xiao, Liu, Hao, Xu, Peirong, Lin, Xinru, Guo, Wenzhong
Format:	Article
Language:	English
Subjects:	CNN Cross-modality Image–text retrieval Person re-identification
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c255t-6db86354198b22958f1794c648d0059feffd7569154fd471d78b1d9160db77493
container_end_page
container_issue
container_start_page	110481
container_title	Pattern recognition
container_volume	152
creator	Ke, Xiao Liu, Hao Xu, Peirong Lin, Xinru Guo, Wenzhong
description	Text-based person search aims to use text descriptions to search for corresponding person images. However, due to the obvious pattern differences in image and text modalities, it is still a challenging problem to align the two modalities. Most existing approaches only consider semantic alignment within a global context or partial parts, lacking consideration of how to match image and text in terms of differences in model information. Therefore, in this paper, we propose an efficient Modality-Aligned Person Search network (MAPS) to address this problem. First, we suppress image-specific information by image feature style normalization to achieve modality knowledge alignment and reduce information differences between text and image. Second, we design a multi-granularity modal feature fusion and optimization method to enrich the modal features. To address the problem of useless and redundant information in the multi-granularity fused features, we propose a Multi-granularity Feature Self-optimization Module (MFSM) to adaptively adjust the corresponding contributions of different granularities in the fused features of the two modalities. Finally, to address the problem of information inconsistency in the training and inference stages, we propose a Cross-instance Feature Alignment (CFA) to help the network enhance category-level generalization ability and improve retrieval performance. Extensive experiments demonstrate that our MAPS achieves state-of-the-art performance on all text-based person search datasets, and significantly outperforms other existing methods. •A novel text-based person search network is proposed by reducing modal differences while learning sufficient modal features.•A multi-granularity feature self-optimization module is designed to optimize the multiscale image modal feature and multi-level semantic text modal feature, so as to learn more discriminative features with suppressing useless and redundant information.•A cross-instance feature alignment is proposed to construct image–text feature pairs with category-level information participating in training.•Extensive experiments in both CUHK-PEDES and ICFG-PEDES datasets show our MAPS obtains the state-of-the-art performance, which significantly outperforms other existing methods.
doi_str_mv	10.1016/j.patcog.2024.110481
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_patcog_2024_110481</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320324002322</els_id><sourcerecordid>S0031320324002322</sourcerecordid><originalsourceid>FETCH-LOGICAL-c255t-6db86354198b22958f1794c648d0059feffd7569154fd471d78b1d9160db77493</originalsourceid><addsrcrecordid>eNp9kMtKxDAYhYMoOI6-gYu-QGr-NGkSBEEGbzDgZlyHNJea0mmHpAz69nasa1dncS4cPoRugZRAoL7ryoOZ7NiWlFBWAhAm4QytQIoKc2D0HK0IqQBXlFSX6CrnjhAQs7FC9zv_NeHGZO-Kg095HIrsTbKfxTGawqYxZ7wfnekL08d22PthKvo5MMShvUYXwfTZ3_zpGn08P-02r3j7_vK2edxiSzmfcO0aWVecgZINpYrLAEIxWzPpCOEq-BCc4LUCzoJjApyQDTgFNXGNEExVa8SW3d87yQd9SHFv0rcGok8AdKcXAPoEQC8A5trDUvPzt2P0SWcb_WC9i8nbSbsx_j_wA0fUZNk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Text-based person search via cross-modal alignment learning</title><source>Elsevier</source><creator>Ke, Xiao ; Liu, Hao ; Xu, Peirong ; Lin, Xinru ; Guo, Wenzhong</creator><creatorcontrib>Ke, Xiao ; Liu, Hao ; Xu, Peirong ; Lin, Xinru ; Guo, Wenzhong</creatorcontrib><description>Text-based person search aims to use text descriptions to search for corresponding person images. However, due to the obvious pattern differences in image and text modalities, it is still a challenging problem to align the two modalities. Most existing approaches only consider semantic alignment within a global context or partial parts, lacking consideration of how to match image and text in terms of differences in model information. Therefore, in this paper, we propose an efficient Modality-Aligned Person Search network (MAPS) to address this problem. First, we suppress image-specific information by image feature style normalization to achieve modality knowledge alignment and reduce information differences between text and image. Second, we design a multi-granularity modal feature fusion and optimization method to enrich the modal features. To address the problem of useless and redundant information in the multi-granularity fused features, we propose a Multi-granularity Feature Self-optimization Module (MFSM) to adaptively adjust the corresponding contributions of different granularities in the fused features of the two modalities. Finally, to address the problem of information inconsistency in the training and inference stages, we propose a Cross-instance Feature Alignment (CFA) to help the network enhance category-level generalization ability and improve retrieval performance. Extensive experiments demonstrate that our MAPS achieves state-of-the-art performance on all text-based person search datasets, and significantly outperforms other existing methods. •A novel text-based person search network is proposed by reducing modal differences while learning sufficient modal features.•A multi-granularity feature self-optimization module is designed to optimize the multiscale image modal feature and multi-level semantic text modal feature, so as to learn more discriminative features with suppressing useless and redundant information.•A cross-instance feature alignment is proposed to construct image–text feature pairs with category-level information participating in training.•Extensive experiments in both CUHK-PEDES and ICFG-PEDES datasets show our MAPS obtains the state-of-the-art performance, which significantly outperforms other existing methods.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>DOI: 10.1016/j.patcog.2024.110481</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>CNN ; Cross-modality ; Image–text retrieval ; Person re-identification</subject><ispartof>Pattern recognition, 2024-08, Vol.152, p.110481, Article 110481</ispartof><rights>2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c255t-6db86354198b22958f1794c648d0059feffd7569154fd471d78b1d9160db77493</cites><orcidid>0000-0002-3189-8728</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Ke, Xiao</creatorcontrib><creatorcontrib>Liu, Hao</creatorcontrib><creatorcontrib>Xu, Peirong</creatorcontrib><creatorcontrib>Lin, Xinru</creatorcontrib><creatorcontrib>Guo, Wenzhong</creatorcontrib><title>Text-based person search via cross-modal alignment learning</title><title>Pattern recognition</title><description>Text-based person search aims to use text descriptions to search for corresponding person images. However, due to the obvious pattern differences in image and text modalities, it is still a challenging problem to align the two modalities. Most existing approaches only consider semantic alignment within a global context or partial parts, lacking consideration of how to match image and text in terms of differences in model information. Therefore, in this paper, we propose an efficient Modality-Aligned Person Search network (MAPS) to address this problem. First, we suppress image-specific information by image feature style normalization to achieve modality knowledge alignment and reduce information differences between text and image. Second, we design a multi-granularity modal feature fusion and optimization method to enrich the modal features. To address the problem of useless and redundant information in the multi-granularity fused features, we propose a Multi-granularity Feature Self-optimization Module (MFSM) to adaptively adjust the corresponding contributions of different granularities in the fused features of the two modalities. Finally, to address the problem of information inconsistency in the training and inference stages, we propose a Cross-instance Feature Alignment (CFA) to help the network enhance category-level generalization ability and improve retrieval performance. Extensive experiments demonstrate that our MAPS achieves state-of-the-art performance on all text-based person search datasets, and significantly outperforms other existing methods. •A novel text-based person search network is proposed by reducing modal differences while learning sufficient modal features.•A multi-granularity feature self-optimization module is designed to optimize the multiscale image modal feature and multi-level semantic text modal feature, so as to learn more discriminative features with suppressing useless and redundant information.•A cross-instance feature alignment is proposed to construct image–text feature pairs with category-level information participating in training.•Extensive experiments in both CUHK-PEDES and ICFG-PEDES datasets show our MAPS obtains the state-of-the-art performance, which significantly outperforms other existing methods.</description><subject>CNN</subject><subject>Cross-modality</subject><subject>Image–text retrieval</subject><subject>Person re-identification</subject><issn>0031-3203</issn><issn>1873-5142</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKxDAYhYMoOI6-gYu-QGr-NGkSBEEGbzDgZlyHNJea0mmHpAz69nasa1dncS4cPoRugZRAoL7ryoOZ7NiWlFBWAhAm4QytQIoKc2D0HK0IqQBXlFSX6CrnjhAQs7FC9zv_NeHGZO-Kg095HIrsTbKfxTGawqYxZ7wfnekL08d22PthKvo5MMShvUYXwfTZ3_zpGn08P-02r3j7_vK2edxiSzmfcO0aWVecgZINpYrLAEIxWzPpCOEq-BCc4LUCzoJjApyQDTgFNXGNEExVa8SW3d87yQd9SHFv0rcGok8AdKcXAPoEQC8A5trDUvPzt2P0SWcb_WC9i8nbSbsx_j_wA0fUZNk</recordid><startdate>202408</startdate><enddate>202408</enddate><creator>Ke, Xiao</creator><creator>Liu, Hao</creator><creator>Xu, Peirong</creator><creator>Lin, Xinru</creator><creator>Guo, Wenzhong</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-3189-8728</orcidid></search><sort><creationdate>202408</creationdate><title>Text-based person search via cross-modal alignment learning</title><author>Ke, Xiao ; Liu, Hao ; Xu, Peirong ; Lin, Xinru ; Guo, Wenzhong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c255t-6db86354198b22958f1794c648d0059feffd7569154fd471d78b1d9160db77493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>CNN</topic><topic>Cross-modality</topic><topic>Image–text retrieval</topic><topic>Person re-identification</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ke, Xiao</creatorcontrib><creatorcontrib>Liu, Hao</creatorcontrib><creatorcontrib>Xu, Peirong</creatorcontrib><creatorcontrib>Lin, Xinru</creatorcontrib><creatorcontrib>Guo, Wenzhong</creatorcontrib><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ke, Xiao</au><au>Liu, Hao</au><au>Xu, Peirong</au><au>Lin, Xinru</au><au>Guo, Wenzhong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Text-based person search via cross-modal alignment learning</atitle><jtitle>Pattern recognition</jtitle><date>2024-08</date><risdate>2024</risdate><volume>152</volume><spage>110481</spage><pages>110481-</pages><artnum>110481</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><abstract>Text-based person search aims to use text descriptions to search for corresponding person images. However, due to the obvious pattern differences in image and text modalities, it is still a challenging problem to align the two modalities. Most existing approaches only consider semantic alignment within a global context or partial parts, lacking consideration of how to match image and text in terms of differences in model information. Therefore, in this paper, we propose an efficient Modality-Aligned Person Search network (MAPS) to address this problem. First, we suppress image-specific information by image feature style normalization to achieve modality knowledge alignment and reduce information differences between text and image. Second, we design a multi-granularity modal feature fusion and optimization method to enrich the modal features. To address the problem of useless and redundant information in the multi-granularity fused features, we propose a Multi-granularity Feature Self-optimization Module (MFSM) to adaptively adjust the corresponding contributions of different granularities in the fused features of the two modalities. Finally, to address the problem of information inconsistency in the training and inference stages, we propose a Cross-instance Feature Alignment (CFA) to help the network enhance category-level generalization ability and improve retrieval performance. Extensive experiments demonstrate that our MAPS achieves state-of-the-art performance on all text-based person search datasets, and significantly outperforms other existing methods. •A novel text-based person search network is proposed by reducing modal differences while learning sufficient modal features.•A multi-granularity feature self-optimization module is designed to optimize the multiscale image modal feature and multi-level semantic text modal feature, so as to learn more discriminative features with suppressing useless and redundant information.•A cross-instance feature alignment is proposed to construct image–text feature pairs with category-level information participating in training.•Extensive experiments in both CUHK-PEDES and ICFG-PEDES datasets show our MAPS obtains the state-of-the-art performance, which significantly outperforms other existing methods.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2024.110481</doi><orcidid>https://orcid.org/0000-0002-3189-8728</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0031-3203
ispartof	Pattern recognition, 2024-08, Vol.152, p.110481, Article 110481
issn	0031-3203 1873-5142
language	eng
recordid	cdi_crossref_primary_10_1016_j_patcog_2024_110481
source	Elsevier
subjects	CNN Cross-modality Image–text retrieval Person re-identification
title	Text-based person search via cross-modal alignment learning
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T18%3A49%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Text-based%20person%20search%20via%20cross-modal%20alignment%20learning&rft.jtitle=Pattern%20recognition&rft.au=Ke,%20Xiao&rft.date=2024-08&rft.volume=152&rft.spage=110481&rft.pages=110481-&rft.artnum=110481&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2024.110481&rft_dat=%3Celsevier_cross%3ES0031320324002322%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c255t-6db86354198b22958f1794c648d0059feffd7569154fd471d78b1d9160db77493%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true