Loading…

Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derive...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-12
Main Authors: Xie, Yudi, Huang, Weichen, Alter, Esther, Schwartz, Jeremy, Tenenbaum, Joshua B, DiCarlo, James J
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Xie, Yudi
Huang, Weichen
Alter, Esther
Schwartz, Jeremy
Tenenbaum, Joshua B
DiCarlo, James J
description Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3144196982</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3144196982</sourcerecordid><originalsourceid>FETCH-proquest_journals_31441969823</originalsourceid><addsrcrecordid>eNqNjcEKwjAQRIMgWLT_EPBcaJO2tueiePIkXsuCW0lJk7qb-v2m4Ad4GmbeDLMRidK6yJpSqZ1Imcc8z1V9UlWlEzE8DBvvZHe7sQwExuFTBi-Rg5kgoOQZggErbTQusLQItHbYTMYCyU9MCWzGgRCmDKx5rZhwJuTI4to7PojtAJYx_eleHC_ne3fNZvLvJZ71o1_IRdTroiyLtm4bpf9rfQEzPEhK</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3144196982</pqid></control><display><type>article</type><title>Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations</title><source>Publicly Available Content Database</source><creator>Xie, Yudi ; Huang, Weichen ; Alter, Esther ; Schwartz, Jeremy ; Tenenbaum, Joshua B ; DiCarlo, James J</creator><creatorcontrib>Xie, Yudi ; Huang, Weichen ; Alter, Esther ; Schwartz, Jeremy ; Tenenbaum, Joshua B ; DiCarlo, James J</creatorcontrib><description>Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Alignment ; Artificial neural networks ; Classification ; Estimation ; Graphical representations ; Questions ; Synthetic data</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3144196982?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Xie, Yudi</creatorcontrib><creatorcontrib>Huang, Weichen</creatorcontrib><creatorcontrib>Alter, Esther</creatorcontrib><creatorcontrib>Schwartz, Jeremy</creatorcontrib><creatorcontrib>Tenenbaum, Joshua B</creatorcontrib><creatorcontrib>DiCarlo, James J</creatorcontrib><title>Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations</title><title>arXiv.org</title><description>Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.</description><subject>Alignment</subject><subject>Artificial neural networks</subject><subject>Classification</subject><subject>Estimation</subject><subject>Graphical representations</subject><subject>Questions</subject><subject>Synthetic data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjcEKwjAQRIMgWLT_EPBcaJO2tueiePIkXsuCW0lJk7qb-v2m4Ad4GmbeDLMRidK6yJpSqZ1Imcc8z1V9UlWlEzE8DBvvZHe7sQwExuFTBi-Rg5kgoOQZggErbTQusLQItHbYTMYCyU9MCWzGgRCmDKx5rZhwJuTI4to7PojtAJYx_eleHC_ne3fNZvLvJZ71o1_IRdTroiyLtm4bpf9rfQEzPEhK</recordid><startdate>20241212</startdate><enddate>20241212</enddate><creator>Xie, Yudi</creator><creator>Huang, Weichen</creator><creator>Alter, Esther</creator><creator>Schwartz, Jeremy</creator><creator>Tenenbaum, Joshua B</creator><creator>DiCarlo, James J</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241212</creationdate><title>Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations</title><author>Xie, Yudi ; Huang, Weichen ; Alter, Esther ; Schwartz, Jeremy ; Tenenbaum, Joshua B ; DiCarlo, James J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31441969823</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Alignment</topic><topic>Artificial neural networks</topic><topic>Classification</topic><topic>Estimation</topic><topic>Graphical representations</topic><topic>Questions</topic><topic>Synthetic data</topic><toplevel>online_resources</toplevel><creatorcontrib>Xie, Yudi</creatorcontrib><creatorcontrib>Huang, Weichen</creatorcontrib><creatorcontrib>Alter, Esther</creatorcontrib><creatorcontrib>Schwartz, Jeremy</creatorcontrib><creatorcontrib>Tenenbaum, Joshua B</creatorcontrib><creatorcontrib>DiCarlo, James J</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xie, Yudi</au><au>Huang, Weichen</au><au>Alter, Esther</au><au>Schwartz, Jeremy</au><au>Tenenbaum, Joshua B</au><au>DiCarlo, James J</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations</atitle><jtitle>arXiv.org</jtitle><date>2024-12-12</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_3144196982
source Publicly Available Content Database
subjects Alignment
Artificial neural networks
Classification
Estimation
Graphical representations
Questions
Synthetic data
title Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T04%3A19%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Vision%20CNNs%20trained%20to%20estimate%20spatial%20latents%20learned%20similar%20ventral-stream-aligned%20representations&rft.jtitle=arXiv.org&rft.au=Xie,%20Yudi&rft.date=2024-12-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3144196982%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31441969823%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3144196982&rft_id=info:pmid/&rfr_iscdi=true