Loading…

Learning to Reason: End-to-End Module Networks for Visual Question Answering

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ronghang Hu, Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, Saenko, Kate
Format:	Conference Proceeding
Language:	English
Subjects:	Cognition Knowledge discovery Layout Neural networks Pragmatics Predictive models Visualization
Citations:	Items that cite this one
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c284t-a8a20fb5a2b79653c192fd6e4f03171522a8c678420e8990ce49e131406dcef13
cites
container_end_page	813
container_issue
container_start_page	804
container_title
container_volume
creator	Ronghang Hu Andreas, Jacob Rohrbach, Marcus Darrell, Trevor Saenko, Kate
description	Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.
doi_str_mv	10.1109/ICCV.2017.93
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_8237355</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8237355</ieee_id><sourcerecordid>8237355</sourcerecordid><originalsourceid>FETCH-LOGICAL-c284t-a8a20fb5a2b79653c192fd6e4f03171522a8c678420e8990ce49e131406dcef13</originalsourceid><addsrcrecordid>eNotzE1LwzAYAOAoCM65mzcv-QOtb_I2X95GmXNQFUV3HVn7Vqo1kaZl-O8d6Om5PYxdCciFAHezKcttLkGY3OEJWzhjhUKrBaB0p2wm0UJmFBTn7CKlDwB00uoZqyryQ-jCOx8jfyGfYrjlq9BkY8yO8IfYTD3xRxoPcfhMvI0D33Zp8j1_niiNXQx8GdKBhuNxyc5a3yda_Dtnb3er1_I-q57Wm3JZZbW0xZh56yW0e-Xl3jitsBZOto2mogUURigpva21sYUEss5BTYUjgaIA3dTUCpyz67-3I6Ld99B9-eFnZyUaVAp_AUHqS6Y</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Learning to Reason: End-to-End Module Networks for Visual Question Answering</title><source>IEEE Xplore All Conference Series</source><creator>Ronghang Hu ; Andreas, Jacob ; Rohrbach, Marcus ; Darrell, Trevor ; Saenko, Kate</creator><creatorcontrib>Ronghang Hu ; Andreas, Jacob ; Rohrbach, Marcus ; Darrell, Trevor ; Saenko, Kate</creatorcontrib><description>Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.</description><identifier>EISSN: 2380-7504</identifier><identifier>EISBN: 9781538610329</identifier><identifier>EISBN: 1538610329</identifier><identifier>DOI: 10.1109/ICCV.2017.93</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cognition ; Knowledge discovery ; Layout ; Neural networks ; Pragmatics ; Predictive models ; Visualization</subject><ispartof>2017 IEEE International Conference on Computer Vision (ICCV), 2017, p.804-813</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c284t-a8a20fb5a2b79653c192fd6e4f03171522a8c678420e8990ce49e131406dcef13</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8237355$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8237355$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ronghang Hu</creatorcontrib><creatorcontrib>Andreas, Jacob</creatorcontrib><creatorcontrib>Rohrbach, Marcus</creatorcontrib><creatorcontrib>Darrell, Trevor</creatorcontrib><creatorcontrib>Saenko, Kate</creatorcontrib><title>Learning to Reason: End-to-End Module Networks for Visual Question Answering</title><title>2017 IEEE International Conference on Computer Vision (ICCV)</title><addtitle>ICCV</addtitle><description>Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.</description><subject>Cognition</subject><subject>Knowledge discovery</subject><subject>Layout</subject><subject>Neural networks</subject><subject>Pragmatics</subject><subject>Predictive models</subject><subject>Visualization</subject><issn>2380-7504</issn><isbn>9781538610329</isbn><isbn>1538610329</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2017</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzE1LwzAYAOAoCM65mzcv-QOtb_I2X95GmXNQFUV3HVn7Vqo1kaZl-O8d6Om5PYxdCciFAHezKcttLkGY3OEJWzhjhUKrBaB0p2wm0UJmFBTn7CKlDwB00uoZqyryQ-jCOx8jfyGfYrjlq9BkY8yO8IfYTD3xRxoPcfhMvI0D33Zp8j1_niiNXQx8GdKBhuNxyc5a3yda_Dtnb3er1_I-q57Wm3JZZbW0xZh56yW0e-Xl3jitsBZOto2mogUURigpva21sYUEss5BTYUjgaIA3dTUCpyz67-3I6Ld99B9-eFnZyUaVAp_AUHqS6Y</recordid><startdate>201710</startdate><enddate>201710</enddate><creator>Ronghang Hu</creator><creator>Andreas, Jacob</creator><creator>Rohrbach, Marcus</creator><creator>Darrell, Trevor</creator><creator>Saenko, Kate</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201710</creationdate><title>Learning to Reason: End-to-End Module Networks for Visual Question Answering</title><author>Ronghang Hu ; Andreas, Jacob ; Rohrbach, Marcus ; Darrell, Trevor ; Saenko, Kate</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c284t-a8a20fb5a2b79653c192fd6e4f03171522a8c678420e8990ce49e131406dcef13</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Cognition</topic><topic>Knowledge discovery</topic><topic>Layout</topic><topic>Neural networks</topic><topic>Pragmatics</topic><topic>Predictive models</topic><topic>Visualization</topic><toplevel>online_resources</toplevel><creatorcontrib>Ronghang Hu</creatorcontrib><creatorcontrib>Andreas, Jacob</creatorcontrib><creatorcontrib>Rohrbach, Marcus</creatorcontrib><creatorcontrib>Darrell, Trevor</creatorcontrib><creatorcontrib>Saenko, Kate</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ronghang Hu</au><au>Andreas, Jacob</au><au>Rohrbach, Marcus</au><au>Darrell, Trevor</au><au>Saenko, Kate</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Learning to Reason: End-to-End Module Networks for Visual Question Answering</atitle><btitle>2017 IEEE International Conference on Computer Vision (ICCV)</btitle><stitle>ICCV</stitle><date>2017-10</date><risdate>2017</risdate><spage>804</spage><epage>813</epage><pages>804-813</pages><eissn>2380-7504</eissn><eisbn>9781538610329</eisbn><eisbn>1538610329</eisbn><coden>IEEPAD</coden><abstract>Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture [3, 2] implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.</abstract><pub>IEEE</pub><doi>10.1109/ICCV.2017.93</doi><tpages>10</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2380-7504
ispartof	2017 IEEE International Conference on Computer Vision (ICCV), 2017, p.804-813
issn	2380-7504
language	eng
recordid	cdi_ieee_primary_8237355
source	IEEE Xplore All Conference Series
subjects	Cognition Knowledge discovery Layout Neural networks Pragmatics Predictive models Visualization
title	Learning to Reason: End-to-End Module Networks for Visual Question Answering
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T12%3A35%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Learning%20to%20Reason:%20End-to-End%20Module%20Networks%20for%20Visual%20Question%20Answering&rft.btitle=2017%20IEEE%20International%20Conference%20on%20Computer%20Vision%20(ICCV)&rft.au=Ronghang%20Hu&rft.date=2017-10&rft.spage=804&rft.epage=813&rft.pages=804-813&rft.eissn=2380-7504&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICCV.2017.93&rft.eisbn=9781538610329&rft.eisbn_list=1538610329&rft_dat=%3Cieee_CHZPO%3E8237355%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c284t-a8a20fb5a2b79653c192fd6e4f03171522a8c678420e8990ce49e131406dcef13%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=8237355&rfr_iscdi=true