Loading…

Unifying (Machine) Vision via Counterfactual World Modeling

Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation model...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2023-06
Main Authors:	Bear, Daniel M, Feigelis, Kevin, Chen, Honglin, Lee, Wanhee, Venkatesh, Rahul, Kotar, Klemen, Durango, Alex, Yamins, Daniel L K
Format:	Article
Language:	English
Subjects:	Image quality Machine vision Modelling Neural networks Optical flow (image analysis) Prediction models Robotics Task complexity Vision systems Visual observation
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Bear, Daniel M Feigelis, Kevin Chen, Honglin Lee, Wanhee Venkatesh, Rahul Kotar, Klemen Durango, Alex Yamins, Daniel L K
description	Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions to a broad spectrum of apparently distinct tasks. Here we introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. The model thereby factors the key physical components of a scene and exposes an interface to them via small sets of visual tokens. This in turn enables CWM's second main idea -- counterfactual prompting -- the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model's output on real inputs versus slightly modified ("counterfactual") inputs. We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks, including estimation of keypoints, optical flow, occlusions, object segments, and relative depth. Taken together, our results show that CWM is a promising path to unifying the manifold strands of machine vision in a conceptually simple foundation.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2822883469</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2822883469</sourcerecordid><originalsourceid>FETCH-proquest_journals_28228834693</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwDs3LTKvMzEtX0PBNTM7IzEvVVAjLLM7Mz1Moy0xUcM4vzStJLUpLTC4pTcxRCM8vyklR8M1PSc0BauFhYE1LzClO5YXS3AzKbq4hzh66BUX5haWpxSXxWfmlRXlAqXgjCyMjCwtjEzNLY-JUAQDKJzbm</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2822883469</pqid></control><display><type>article</type><title>Unifying (Machine) Vision via Counterfactual World Modeling</title><source>Publicly Available Content Database</source><creator>Bear, Daniel M ; Feigelis, Kevin ; Chen, Honglin ; Lee, Wanhee ; Venkatesh, Rahul ; Kotar, Klemen ; Durango, Alex ; Yamins, Daniel L K</creator><creatorcontrib>Bear, Daniel M ; Feigelis, Kevin ; Chen, Honglin ; Lee, Wanhee ; Venkatesh, Rahul ; Kotar, Klemen ; Durango, Alex ; Yamins, Daniel L K</creatorcontrib><description>Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions to a broad spectrum of apparently distinct tasks. Here we introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. The model thereby factors the key physical components of a scene and exposes an interface to them via small sets of visual tokens. This in turn enables CWM's second main idea -- counterfactual prompting -- the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model's output on real inputs versus slightly modified ("counterfactual") inputs. We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks, including estimation of keypoints, optical flow, occlusions, object segments, and relative depth. Taken together, our results show that CWM is a promising path to unifying the manifold strands of machine vision in a conceptually simple foundation.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Image quality ; Machine vision ; Modelling ; Neural networks ; Optical flow (image analysis) ; Prediction models ; Robotics ; Task complexity ; Vision systems ; Visual observation</subject><ispartof>arXiv.org, 2023-06</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2822883469?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Bear, Daniel M</creatorcontrib><creatorcontrib>Feigelis, Kevin</creatorcontrib><creatorcontrib>Chen, Honglin</creatorcontrib><creatorcontrib>Lee, Wanhee</creatorcontrib><creatorcontrib>Venkatesh, Rahul</creatorcontrib><creatorcontrib>Kotar, Klemen</creatorcontrib><creatorcontrib>Durango, Alex</creatorcontrib><creatorcontrib>Yamins, Daniel L K</creatorcontrib><title>Unifying (Machine) Vision via Counterfactual World Modeling</title><title>arXiv.org</title><description>Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions to a broad spectrum of apparently distinct tasks. Here we introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. The model thereby factors the key physical components of a scene and exposes an interface to them via small sets of visual tokens. This in turn enables CWM's second main idea -- counterfactual prompting -- the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model's output on real inputs versus slightly modified ("counterfactual") inputs. We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks, including estimation of keypoints, optical flow, occlusions, object segments, and relative depth. Taken together, our results show that CWM is a promising path to unifying the manifold strands of machine vision in a conceptually simple foundation.</description><subject>Image quality</subject><subject>Machine vision</subject><subject>Modelling</subject><subject>Neural networks</subject><subject>Optical flow (image analysis)</subject><subject>Prediction models</subject><subject>Robotics</subject><subject>Task complexity</subject><subject>Vision systems</subject><subject>Visual observation</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwDs3LTKvMzEtX0PBNTM7IzEvVVAjLLM7Mz1Moy0xUcM4vzStJLUpLTC4pTcxRCM8vyklR8M1PSc0BauFhYE1LzClO5YXS3AzKbq4hzh66BUX5haWpxSXxWfmlRXlAqXgjCyMjCwtjEzNLY-JUAQDKJzbm</recordid><startdate>20230602</startdate><enddate>20230602</enddate><creator>Bear, Daniel M</creator><creator>Feigelis, Kevin</creator><creator>Chen, Honglin</creator><creator>Lee, Wanhee</creator><creator>Venkatesh, Rahul</creator><creator>Kotar, Klemen</creator><creator>Durango, Alex</creator><creator>Yamins, Daniel L K</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230602</creationdate><title>Unifying (Machine) Vision via Counterfactual World Modeling</title><author>Bear, Daniel M ; Feigelis, Kevin ; Chen, Honglin ; Lee, Wanhee ; Venkatesh, Rahul ; Kotar, Klemen ; Durango, Alex ; Yamins, Daniel L K</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28228834693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Image quality</topic><topic>Machine vision</topic><topic>Modelling</topic><topic>Neural networks</topic><topic>Optical flow (image analysis)</topic><topic>Prediction models</topic><topic>Robotics</topic><topic>Task complexity</topic><topic>Vision systems</topic><topic>Visual observation</topic><toplevel>online_resources</toplevel><creatorcontrib>Bear, Daniel M</creatorcontrib><creatorcontrib>Feigelis, Kevin</creatorcontrib><creatorcontrib>Chen, Honglin</creatorcontrib><creatorcontrib>Lee, Wanhee</creatorcontrib><creatorcontrib>Venkatesh, Rahul</creatorcontrib><creatorcontrib>Kotar, Klemen</creatorcontrib><creatorcontrib>Durango, Alex</creatorcontrib><creatorcontrib>Yamins, Daniel L K</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bear, Daniel M</au><au>Feigelis, Kevin</au><au>Chen, Honglin</au><au>Lee, Wanhee</au><au>Venkatesh, Rahul</au><au>Kotar, Klemen</au><au>Durango, Alex</au><au>Yamins, Daniel L K</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Unifying (Machine) Vision via Counterfactual World Modeling</atitle><jtitle>arXiv.org</jtitle><date>2023-06-02</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Leading approaches in machine vision employ different architectures for different tasks, trained on costly task-specific labeled datasets. This complexity has held back progress in areas, such as robotics, where robust task-general perception remains a bottleneck. In contrast, "foundation models" of natural language have shown how large pre-trained neural networks can provide zero-shot solutions to a broad spectrum of apparently distinct tasks. Here we introduce Counterfactual World Modeling (CWM), a framework for constructing a visual foundation model: a unified, unsupervised network that can be prompted to perform a wide variety of visual computations. CWM has two key components, which resolve the core issues that have hindered application of the foundation model concept to vision. The first is structured masking, a generalization of masked prediction methods that encourages a prediction model to capture the low-dimensional structure in visual data. The model thereby factors the key physical components of a scene and exposes an interface to them via small sets of visual tokens. This in turn enables CWM's second main idea -- counterfactual prompting -- the observation that many apparently distinct visual representations can be computed, in a zero-shot manner, by comparing the prediction model's output on real inputs versus slightly modified ("counterfactual") inputs. We show that CWM generates high-quality readouts on real-world images and videos for a diversity of tasks, including estimation of keypoints, optical flow, occlusions, object segments, and relative depth. Taken together, our results show that CWM is a promising path to unifying the manifold strands of machine vision in a conceptually simple foundation.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2822883469
source	Publicly Available Content Database
subjects	Image quality Machine vision Modelling Neural networks Optical flow (image analysis) Prediction models Robotics Task complexity Vision systems Visual observation
title	Unifying (Machine) Vision via Counterfactual World Modeling
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T02%3A34%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Unifying%20(Machine)%20Vision%20via%20Counterfactual%20World%20Modeling&rft.jtitle=arXiv.org&rft.au=Bear,%20Daniel%20M&rft.date=2023-06-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2822883469%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28228834693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2822883469&rft_id=info:pmid/&rfr_iscdi=true