Loading…

Pseudo-labeling with keyword refining for few-supervised video captioning

Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introdu...

Full description

Saved in:

Bibliographic Details
Published in:	Pattern recognition 2025-03, Vol.159, p.111176, Article 111176
Main Authors:	Li, Ping, Wang, Tao, Zhao, Xinkui, Xu, Xianghua, Song, Mingli
Format:	Article
Language:	English
Subjects:	Few supervision Gated fusion Keyword refiner Pseudo-labeling Video captioning
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c231t-fdc75c5ba0daed294315755d189e132cc268e50346bf1b00b69fdb6b6a6a900b3
container_end_page
container_issue
container_start_page	111176
container_title	Pattern recognition
container_volume	159
creator	Li, Ping Wang, Tao Zhao, Xinkui Xu, Xianghua Song, Mingli
description	Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (i.e., edit words), the former module guides the model to edit words using some actions (e.g., copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. •A new task named few-supervised video captioning that uses only one human-sentence is introduced.•A pseudo labeling strategy with lexical constraint is proposed to augment knowledge.•A keyword-refined captioning module with video-text gated fusion is designed generating high-quality sentences by modeling global context.•Empirical studies demonstrate the satisfying quality of the generated captions by proposed method.
doi_str_mv	10.1016/j.patcog.2024.111176
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_patcog_2024_111176</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320324009270</els_id><sourcerecordid>S0031320324009270</sourcerecordid><originalsourceid>FETCH-LOGICAL-c231t-fdc75c5ba0daed294315755d189e132cc268e50346bf1b00b69fdb6b6a6a900b3</originalsourceid><addsrcrecordid>eNp9kM1OwzAQhH0AiVJ4Aw55gYS1nTjNBQlV_FSqBAc4W_5ZF5cQR3baqm9PonBmL6sd7YxGHyF3FAoKVNzvi14NJuwKBqws6Di1uCALAE5zzoBfkeuU9gC0piVbkM17woMNeas0tr7bZSc_fGXfeD6FaLOIzneT6kLMHJ7ydOgxHn1Cmx29xZAZ1Q8-TD835NKpNuHt316Sz-enj_Vrvn172awft7lhnA65s6auTKUVWIWWNSWnVV1Vlq4apJwZw8QKK-Cl0I5qAC0aZ7XQQgnVjCdfknLONTGkNDaUffQ_Kp4lBTkhkHs5I5ATAjkjGG0Psw3HbkePUSbjsTNofUQzSBv8_wG_pe9qCw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Pseudo-labeling with keyword refining for few-supervised video captioning</title><source>ScienceDirect Journals</source><creator>Li, Ping ; Wang, Tao ; Zhao, Xinkui ; Xu, Xianghua ; Song, Mingli</creator><creatorcontrib>Li, Ping ; Wang, Tao ; Zhao, Xinkui ; Xu, Xianghua ; Song, Mingli</creatorcontrib><description>Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (i.e., edit words), the former module guides the model to edit words using some actions (e.g., copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. •A new task named few-supervised video captioning that uses only one human-sentence is introduced.•A pseudo labeling strategy with lexical constraint is proposed to augment knowledge.•A keyword-refined captioning module with video-text gated fusion is designed generating high-quality sentences by modeling global context.•Empirical studies demonstrate the satisfying quality of the generated captions by proposed method.</description><identifier>ISSN: 0031-3203</identifier><identifier>DOI: 10.1016/j.patcog.2024.111176</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Few supervision ; Gated fusion ; Keyword refiner ; Pseudo-labeling ; Video captioning</subject><ispartof>Pattern recognition, 2025-03, Vol.159, p.111176, Article 111176</ispartof><rights>2024 Elsevier Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c231t-fdc75c5ba0daed294315755d189e132cc268e50346bf1b00b69fdb6b6a6a900b3</cites><orcidid>0000-0002-8515-7773 ; 0000-0002-1115-5652</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Li, Ping</creatorcontrib><creatorcontrib>Wang, Tao</creatorcontrib><creatorcontrib>Zhao, Xinkui</creatorcontrib><creatorcontrib>Xu, Xianghua</creatorcontrib><creatorcontrib>Song, Mingli</creatorcontrib><title>Pseudo-labeling with keyword refining for few-supervised video captioning</title><title>Pattern recognition</title><description>Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (i.e., edit words), the former module guides the model to edit words using some actions (e.g., copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. •A new task named few-supervised video captioning that uses only one human-sentence is introduced.•A pseudo labeling strategy with lexical constraint is proposed to augment knowledge.•A keyword-refined captioning module with video-text gated fusion is designed generating high-quality sentences by modeling global context.•Empirical studies demonstrate the satisfying quality of the generated captions by proposed method.</description><subject>Few supervision</subject><subject>Gated fusion</subject><subject>Keyword refiner</subject><subject>Pseudo-labeling</subject><subject>Video captioning</subject><issn>0031-3203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kM1OwzAQhH0AiVJ4Aw55gYS1nTjNBQlV_FSqBAc4W_5ZF5cQR3baqm9PonBmL6sd7YxGHyF3FAoKVNzvi14NJuwKBqws6Di1uCALAE5zzoBfkeuU9gC0piVbkM17woMNeas0tr7bZSc_fGXfeD6FaLOIzneT6kLMHJ7ydOgxHn1Cmx29xZAZ1Q8-TD835NKpNuHt316Sz-enj_Vrvn172awft7lhnA65s6auTKUVWIWWNSWnVV1Vlq4apJwZw8QKK-Cl0I5qAC0aZ7XQQgnVjCdfknLONTGkNDaUffQ_Kp4lBTkhkHs5I5ATAjkjGG0Psw3HbkePUSbjsTNofUQzSBv8_wG_pe9qCw</recordid><startdate>202503</startdate><enddate>202503</enddate><creator>Li, Ping</creator><creator>Wang, Tao</creator><creator>Zhao, Xinkui</creator><creator>Xu, Xianghua</creator><creator>Song, Mingli</creator><general>Elsevier Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8515-7773</orcidid><orcidid>https://orcid.org/0000-0002-1115-5652</orcidid></search><sort><creationdate>202503</creationdate><title>Pseudo-labeling with keyword refining for few-supervised video captioning</title><author>Li, Ping ; Wang, Tao ; Zhao, Xinkui ; Xu, Xianghua ; Song, Mingli</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c231t-fdc75c5ba0daed294315755d189e132cc268e50346bf1b00b69fdb6b6a6a900b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Few supervision</topic><topic>Gated fusion</topic><topic>Keyword refiner</topic><topic>Pseudo-labeling</topic><topic>Video captioning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Ping</creatorcontrib><creatorcontrib>Wang, Tao</creatorcontrib><creatorcontrib>Zhao, Xinkui</creatorcontrib><creatorcontrib>Xu, Xianghua</creatorcontrib><creatorcontrib>Song, Mingli</creatorcontrib><collection>CrossRef</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Ping</au><au>Wang, Tao</au><au>Zhao, Xinkui</au><au>Xu, Xianghua</au><au>Song, Mingli</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Pseudo-labeling with keyword refining for few-supervised video captioning</atitle><jtitle>Pattern recognition</jtitle><date>2025-03</date><risdate>2025</risdate><volume>159</volume><spage>111176</spage><pages>111176-</pages><artnum>111176</artnum><issn>0031-3203</issn><abstract>Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (i.e., edit words), the former module guides the model to edit words using some actions (e.g., copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. •A new task named few-supervised video captioning that uses only one human-sentence is introduced.•A pseudo labeling strategy with lexical constraint is proposed to augment knowledge.•A keyword-refined captioning module with video-text gated fusion is designed generating high-quality sentences by modeling global context.•Empirical studies demonstrate the satisfying quality of the generated captions by proposed method.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2024.111176</doi><orcidid>https://orcid.org/0000-0002-8515-7773</orcidid><orcidid>https://orcid.org/0000-0002-1115-5652</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0031-3203
ispartof	Pattern recognition, 2025-03, Vol.159, p.111176, Article 111176
issn	0031-3203
language	eng
recordid	cdi_crossref_primary_10_1016_j_patcog_2024_111176
source	ScienceDirect Journals
subjects	Few supervision Gated fusion Keyword refiner Pseudo-labeling Video captioning
title	Pseudo-labeling with keyword refining for few-supervised video captioning
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T06%3A02%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Pseudo-labeling%20with%20keyword%20refining%20for%20few-supervised%20video%20captioning&rft.jtitle=Pattern%20recognition&rft.au=Li,%20Ping&rft.date=2025-03&rft.volume=159&rft.spage=111176&rft.pages=111176-&rft.artnum=111176&rft.issn=0031-3203&rft_id=info:doi/10.1016/j.patcog.2024.111176&rft_dat=%3Celsevier_cross%3ES0031320324009270%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c231t-fdc75c5ba0daed294315755d189e132cc268e50346bf1b00b69fdb6b6a6a900b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true