Loading…

State-space models can learn in-context by gradient descent

Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architecture...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-10
Main Authors:	Sushma, Neeraj Mohan, Tian, Yudou, Mestha, Harshvardhan, Colombo, Nicolo, Kappel, David, Subramoney, Anand
Format:	Article
Language:	English
Subjects:	Attention Context Learning Parameters Regression analysis Regression models State space models
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Sushma, Neeraj Mohan Tian, Yudou Mestha, Harshvardhan Colombo, Nicolo Kappel, David Subramoney, Anand
description	Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning. We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent. Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator, which can be `applied' to the parameters of the implicit regression model. We validate our construction by training randomly initialized augmented SSMs on simple linear regression tasks. The empirically optimized parameters match the theoretical ones, obtained analytically from the implicit model construction. Extensions to multi-step linear and non-linear regression yield consistent results. The constructed SSM encompasses features of modern deep state-space models, with the potential for scalable training and effectiveness even in general tasks. The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3117167891</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3117167891</sourcerecordid><originalsourceid>FETCH-proquest_journals_31171678913</originalsourceid><addsrcrecordid>eNqNyrsKwjAUgOEgCBbtOxxwDuTSmziK4q57ielRWmJSc1LQt7eDD-D0Df-_YJnSWvKmUGrFcqJBCKGqWpWlztj-kkxCTqOxCM_QoSOwxoNDEz30ntvgE74T3D7wiKbr0SfokOzshi3vxhHmP9dsezpeD2c-xvCakFI7hCn6ObVaylpWdbOT-r_rC-feNxI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3117167891</pqid></control><display><type>article</type><title>State-space models can learn in-context by gradient descent</title><source>Publicly Available Content (ProQuest)</source><creator>Sushma, Neeraj Mohan ; Tian, Yudou ; Mestha, Harshvardhan ; Colombo, Nicolo ; Kappel, David ; Subramoney, Anand</creator><creatorcontrib>Sushma, Neeraj Mohan ; Tian, Yudou ; Mestha, Harshvardhan ; Colombo, Nicolo ; Kappel, David ; Subramoney, Anand</creatorcontrib><description>Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning. We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent. Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator, which can be `applied' to the parameters of the implicit regression model. We validate our construction by training randomly initialized augmented SSMs on simple linear regression tasks. The empirically optimized parameters match the theoretical ones, obtained analytically from the implicit model construction. Extensions to multi-step linear and non-linear regression yield consistent results. The constructed SSM encompasses features of modern deep state-space models, with the potential for scalable training and effectiveness even in general tasks. The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Attention ; Context ; Learning ; Parameters ; Regression analysis ; Regression models ; State space models</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3117167891?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Sushma, Neeraj Mohan</creatorcontrib><creatorcontrib>Tian, Yudou</creatorcontrib><creatorcontrib>Mestha, Harshvardhan</creatorcontrib><creatorcontrib>Colombo, Nicolo</creatorcontrib><creatorcontrib>Kappel, David</creatorcontrib><creatorcontrib>Subramoney, Anand</creatorcontrib><title>State-space models can learn in-context by gradient descent</title><title>arXiv.org</title><description>Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning. We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent. Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator, which can be `applied' to the parameters of the implicit regression model. We validate our construction by training randomly initialized augmented SSMs on simple linear regression tasks. The empirically optimized parameters match the theoretical ones, obtained analytically from the implicit model construction. Extensions to multi-step linear and non-linear regression yield consistent results. The constructed SSM encompasses features of modern deep state-space models, with the potential for scalable training and effectiveness even in general tasks. The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.</description><subject>Attention</subject><subject>Context</subject><subject>Learning</subject><subject>Parameters</subject><subject>Regression analysis</subject><subject>Regression models</subject><subject>State space models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyrsKwjAUgOEgCBbtOxxwDuTSmziK4q57ielRWmJSc1LQt7eDD-D0Df-_YJnSWvKmUGrFcqJBCKGqWpWlztj-kkxCTqOxCM_QoSOwxoNDEz30ntvgE74T3D7wiKbr0SfokOzshi3vxhHmP9dsezpeD2c-xvCakFI7hCn6ObVaylpWdbOT-r_rC-feNxI</recordid><startdate>20241015</startdate><enddate>20241015</enddate><creator>Sushma, Neeraj Mohan</creator><creator>Tian, Yudou</creator><creator>Mestha, Harshvardhan</creator><creator>Colombo, Nicolo</creator><creator>Kappel, David</creator><creator>Subramoney, Anand</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241015</creationdate><title>State-space models can learn in-context by gradient descent</title><author>Sushma, Neeraj Mohan ; Tian, Yudou ; Mestha, Harshvardhan ; Colombo, Nicolo ; Kappel, David ; Subramoney, Anand</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31171678913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Attention</topic><topic>Context</topic><topic>Learning</topic><topic>Parameters</topic><topic>Regression analysis</topic><topic>Regression models</topic><topic>State space models</topic><toplevel>online_resources</toplevel><creatorcontrib>Sushma, Neeraj Mohan</creatorcontrib><creatorcontrib>Tian, Yudou</creatorcontrib><creatorcontrib>Mestha, Harshvardhan</creatorcontrib><creatorcontrib>Colombo, Nicolo</creatorcontrib><creatorcontrib>Kappel, David</creatorcontrib><creatorcontrib>Subramoney, Anand</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sushma, Neeraj Mohan</au><au>Tian, Yudou</au><au>Mestha, Harshvardhan</au><au>Colombo, Nicolo</au><au>Kappel, David</au><au>Subramoney, Anand</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>State-space models can learn in-context by gradient descent</atitle><jtitle>arXiv.org</jtitle><date>2024-10-15</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning. We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent. Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator, which can be `applied' to the parameters of the implicit regression model. We validate our construction by training randomly initialized augmented SSMs on simple linear regression tasks. The empirically optimized parameters match the theoretical ones, obtained analytically from the implicit model construction. Extensions to multi-step linear and non-linear regression yield consistent results. The constructed SSM encompasses features of modern deep state-space models, with the potential for scalable training and effectiveness even in general tasks. The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3117167891
source	Publicly Available Content (ProQuest)
subjects	Attention Context Learning Parameters Regression analysis Regression models State space models
title	State-space models can learn in-context by gradient descent
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T15%3A14%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=State-space%20models%20can%20learn%20in-context%20by%20gradient%20descent&rft.jtitle=arXiv.org&rft.au=Sushma,%20Neeraj%20Mohan&rft.date=2024-10-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3117167891%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31171678913%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3117167891&rft_id=info:pmid/&rfr_iscdi=true