Loading…
Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner
Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored....
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 530 |
container_issue | |
container_start_page | 521 |
container_title | |
container_volume | |
creator | Tseng-Hung Chen Yuan-Hong Liao Ching-Yao Chuang Wan-Ting Hsu Jianlong Fu Min Sun |
description | Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries - captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. In particular, on CUB-200-2011, we achieve 21.8% CIDEr-D improvement after adaptation. Utilizing critics during inference further gives another 4.5% boost. |
doi_str_mv | 10.1109/ICCV.2017.64 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_8237326</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8237326</ieee_id><sourcerecordid>8237326</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-b1bfd4df21c7c10d79f78ac4e3c2553c9a0309ee40e20d3016722153051d60683</originalsourceid><addsrcrecordid>eNotj81KAzEYRaMgWGt37tzkAZz6JZn8uSux6kjBhaPbkk6-qZH5KUlRfHsHdHW5B-6BS8gVgyVjYG8r596XHJheqvKELKw2TAqjGAhuT8mMCwOFllCek4ucPwGE5UbNyPPrx_h9Q1fBH47UD4HW2HV3U__ClH2KvqN18nGIw56OLXVpzLm4H_sJ0ar3e6RuWsZxwHRJzlrfZVz855y8Paxr91RsXh4rt9oUkWl5LHZs14YytJw1umEQtG218U2JouFSisZ6EGARS0AOQQBTmvPpC0gWFCgj5uT6zxsRcXtIsffpZ2u40IIr8QtGGUoy</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner</title><source>IEEE Xplore All Conference Series</source><creator>Tseng-Hung Chen ; Yuan-Hong Liao ; Ching-Yao Chuang ; Wan-Ting Hsu ; Jianlong Fu ; Min Sun</creator><creatorcontrib>Tseng-Hung Chen ; Yuan-Hong Liao ; Ching-Yao Chuang ; Wan-Ting Hsu ; Jianlong Fu ; Min Sun</creatorcontrib><description>Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries - captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. In particular, on CUB-200-2011, we achieve 21.8% CIDEr-D improvement after adaptation. Utilizing critics during inference further gives another 4.5% boost.</description><identifier>EISSN: 2380-7504</identifier><identifier>EISBN: 9781538610329</identifier><identifier>EISBN: 1538610329</identifier><identifier>DOI: 10.1109/ICCV.2017.64</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Birds ; Measurement ; Planning ; Testing ; Training ; Training data</subject><ispartof>2017 IEEE International Conference on Computer Vision (ICCV), 2017, p.521-530</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8237326$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27924,54554,54931</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8237326$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tseng-Hung Chen</creatorcontrib><creatorcontrib>Yuan-Hong Liao</creatorcontrib><creatorcontrib>Ching-Yao Chuang</creatorcontrib><creatorcontrib>Wan-Ting Hsu</creatorcontrib><creatorcontrib>Jianlong Fu</creatorcontrib><creatorcontrib>Min Sun</creatorcontrib><title>Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner</title><title>2017 IEEE International Conference on Computer Vision (ICCV)</title><addtitle>ICCV</addtitle><description>Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries - captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. In particular, on CUB-200-2011, we achieve 21.8% CIDEr-D improvement after adaptation. Utilizing critics during inference further gives another 4.5% boost.</description><subject>Birds</subject><subject>Measurement</subject><subject>Planning</subject><subject>Testing</subject><subject>Training</subject><subject>Training data</subject><issn>2380-7504</issn><isbn>9781538610329</isbn><isbn>1538610329</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2017</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81KAzEYRaMgWGt37tzkAZz6JZn8uSux6kjBhaPbkk6-qZH5KUlRfHsHdHW5B-6BS8gVgyVjYG8r596XHJheqvKELKw2TAqjGAhuT8mMCwOFllCek4ucPwGE5UbNyPPrx_h9Q1fBH47UD4HW2HV3U__ClH2KvqN18nGIw56OLXVpzLm4H_sJ0ar3e6RuWsZxwHRJzlrfZVz855y8Paxr91RsXh4rt9oUkWl5LHZs14YytJw1umEQtG218U2JouFSisZ6EGARS0AOQQBTmvPpC0gWFCgj5uT6zxsRcXtIsffpZ2u40IIr8QtGGUoy</recordid><startdate>201710</startdate><enddate>201710</enddate><creator>Tseng-Hung Chen</creator><creator>Yuan-Hong Liao</creator><creator>Ching-Yao Chuang</creator><creator>Wan-Ting Hsu</creator><creator>Jianlong Fu</creator><creator>Min Sun</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201710</creationdate><title>Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner</title><author>Tseng-Hung Chen ; Yuan-Hong Liao ; Ching-Yao Chuang ; Wan-Ting Hsu ; Jianlong Fu ; Min Sun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-b1bfd4df21c7c10d79f78ac4e3c2553c9a0309ee40e20d3016722153051d60683</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Birds</topic><topic>Measurement</topic><topic>Planning</topic><topic>Testing</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Tseng-Hung Chen</creatorcontrib><creatorcontrib>Yuan-Hong Liao</creatorcontrib><creatorcontrib>Ching-Yao Chuang</creatorcontrib><creatorcontrib>Wan-Ting Hsu</creatorcontrib><creatorcontrib>Jianlong Fu</creatorcontrib><creatorcontrib>Min Sun</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tseng-Hung Chen</au><au>Yuan-Hong Liao</au><au>Ching-Yao Chuang</au><au>Wan-Ting Hsu</au><au>Jianlong Fu</au><au>Min Sun</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner</atitle><btitle>2017 IEEE International Conference on Computer Vision (ICCV)</btitle><stitle>ICCV</stitle><date>2017-10</date><risdate>2017</risdate><spage>521</spage><epage>530</epage><pages>521-530</pages><eissn>2380-7504</eissn><eisbn>9781538610329</eisbn><eisbn>1538610329</eisbn><coden>IEEPAD</coden><abstract>Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries - captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. In particular, on CUB-200-2011, we achieve 21.8% CIDEr-D improvement after adaptation. Utilizing critics during inference further gives another 4.5% boost.</abstract><pub>IEEE</pub><doi>10.1109/ICCV.2017.64</doi><tpages>10</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2380-7504 |
ispartof | 2017 IEEE International Conference on Computer Vision (ICCV), 2017, p.521-530 |
issn | 2380-7504 |
language | eng |
recordid | cdi_ieee_primary_8237326 |
source | IEEE Xplore All Conference Series |
subjects | Birds Measurement Planning Testing Training Training data |
title | Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T01%3A54%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Show,%20Adapt%20and%20Tell:%20Adversarial%20Training%20of%20Cross-Domain%20Image%20Captioner&rft.btitle=2017%20IEEE%20International%20Conference%20on%20Computer%20Vision%20(ICCV)&rft.au=Tseng-Hung%20Chen&rft.date=2017-10&rft.spage=521&rft.epage=530&rft.pages=521-530&rft.eissn=2380-7504&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICCV.2017.64&rft.eisbn=9781538610329&rft.eisbn_list=1538610329&rft_dat=%3Cieee_CHZPO%3E8237326%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i175t-b1bfd4df21c7c10d79f78ac4e3c2553c9a0309ee40e20d3016722153051d60683%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=8237326&rfr_iscdi=true |