Loading…

Scalable iterative pruning of large language and vision models using block coordinate descent

Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-11
Main Authors:	Rosenberg, Gili, Brubaker, J Kyle, Schuetz, Martin J A, Zhu, Elton Yechao, Kadıoğlu, Serdar, Borujeni, Sima E, Katzgraber, Helmut G
Format:	Article
Language:	English
Subjects:	Brain Combinatorial analysis Large language models Machine learning Neural networks Optimization Performance measurement Pruning Quantum computers Surgeons
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Rosenberg, Gili Brubaker, J Kyle Schuetz, Martin J A Zhu, Elton Yechao Kadıoğlu, Serdar Borujeni, Sima E Katzgraber, Helmut G
description	Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3133825391</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3133825391</sourcerecordid><originalsourceid>FETCH-proquest_journals_31338253913</originalsourceid><addsrcrecordid>eNqNis0KgkAURocgSMp3uNBa0LlZto6ifW1DxvEqY9OMzY_Pn0EP0OZ8B76zYAlHLLJqx_mKpd4PeZ7z_YGXJSbscZNCi0YTqEBOBDURjC4aZXqwHWjheppp-ihmEaaFSXllDbxsS9pD9N-y0VY-QVrrWmVEIGjJSzJhw5ad0J7S367Z9nK-n67Z6Ow7kg_1YKMz81VjgVjxEo8F_ld9AE87REc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133825391</pqid></control><display><type>article</type><title>Scalable iterative pruning of large language and vision models using block coordinate descent</title><source>Publicly Available Content (ProQuest)</source><creator>Rosenberg, Gili ; Brubaker, J Kyle ; Schuetz, Martin J A ; Zhu, Elton Yechao ; Kadıoğlu, Serdar ; Borujeni, Sima E ; Katzgraber, Helmut G</creator><creatorcontrib>Rosenberg, Gili ; Brubaker, J Kyle ; Schuetz, Martin J A ; Zhu, Elton Yechao ; Kadıoğlu, Serdar ; Borujeni, Sima E ; Katzgraber, Helmut G</creatorcontrib><description>Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Brain ; Combinatorial analysis ; Large language models ; Machine learning ; Neural networks ; Optimization ; Performance measurement ; Pruning ; Quantum computers ; Surgeons</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3133825391?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Rosenberg, Gili</creatorcontrib><creatorcontrib>Brubaker, J Kyle</creatorcontrib><creatorcontrib>Schuetz, Martin J A</creatorcontrib><creatorcontrib>Zhu, Elton Yechao</creatorcontrib><creatorcontrib>Kadıoğlu, Serdar</creatorcontrib><creatorcontrib>Borujeni, Sima E</creatorcontrib><creatorcontrib>Katzgraber, Helmut G</creatorcontrib><title>Scalable iterative pruning of large language and vision models using block coordinate descent</title><title>arXiv.org</title><description>Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.</description><subject>Brain</subject><subject>Combinatorial analysis</subject><subject>Large language models</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Performance measurement</subject><subject>Pruning</subject><subject>Quantum computers</subject><subject>Surgeons</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNis0KgkAURocgSMp3uNBa0LlZto6ifW1DxvEqY9OMzY_Pn0EP0OZ8B76zYAlHLLJqx_mKpd4PeZ7z_YGXJSbscZNCi0YTqEBOBDURjC4aZXqwHWjheppp-ihmEaaFSXllDbxsS9pD9N-y0VY-QVrrWmVEIGjJSzJhw5ad0J7S367Z9nK-n67Z6Ow7kg_1YKMz81VjgVjxEo8F_ld9AE87REc</recordid><startdate>20241126</startdate><enddate>20241126</enddate><creator>Rosenberg, Gili</creator><creator>Brubaker, J Kyle</creator><creator>Schuetz, Martin J A</creator><creator>Zhu, Elton Yechao</creator><creator>Kadıoğlu, Serdar</creator><creator>Borujeni, Sima E</creator><creator>Katzgraber, Helmut G</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241126</creationdate><title>Scalable iterative pruning of large language and vision models using block coordinate descent</title><author>Rosenberg, Gili ; Brubaker, J Kyle ; Schuetz, Martin J A ; Zhu, Elton Yechao ; Kadıoğlu, Serdar ; Borujeni, Sima E ; Katzgraber, Helmut G</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31338253913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Brain</topic><topic>Combinatorial analysis</topic><topic>Large language models</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Performance measurement</topic><topic>Pruning</topic><topic>Quantum computers</topic><topic>Surgeons</topic><toplevel>online_resources</toplevel><creatorcontrib>Rosenberg, Gili</creatorcontrib><creatorcontrib>Brubaker, J Kyle</creatorcontrib><creatorcontrib>Schuetz, Martin J A</creatorcontrib><creatorcontrib>Zhu, Elton Yechao</creatorcontrib><creatorcontrib>Kadıoğlu, Serdar</creatorcontrib><creatorcontrib>Borujeni, Sima E</creatorcontrib><creatorcontrib>Katzgraber, Helmut G</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rosenberg, Gili</au><au>Brubaker, J Kyle</au><au>Schuetz, Martin J A</au><au>Zhu, Elton Yechao</au><au>Kadıoğlu, Serdar</au><au>Borujeni, Sima E</au><au>Katzgraber, Helmut G</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Scalable iterative pruning of large language and vision models using block coordinate descent</atitle><jtitle>arXiv.org</jtitle><date>2024-11-26</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Pruning neural networks, which involves removing a fraction of their weights, can often maintain high accuracy while significantly reducing model complexity, at least up to a certain limit. We present a neural network pruning technique that builds upon the Combinatorial Brain Surgeon, but solves an optimization problem over a subset of the network weights in an iterative, block-wise manner using block coordinate descent. The iterative, block-based nature of this pruning technique, which we dub ``iterative Combinatorial Brain Surgeon'' (iCBS) allows for scalability to very large models, including large language models (LLMs), that may not be feasible with a one-shot combinatorial optimization approach. When applied to large models like Mistral and DeiT, iCBS achieves higher performance metrics at the same density levels compared to existing pruning methods such as Wanda. This demonstrates the effectiveness of this iterative, block-wise pruning method in compressing and optimizing the performance of large deep learning models, even while optimizing over only a small fraction of the weights. Moreover, our approach allows for a quality-time (or cost) tradeoff that is not available when using a one-shot pruning technique alone. The block-wise formulation of the optimization problem enables the use of hardware accelerators, potentially offsetting the increased computational costs compared to one-shot pruning methods like Wanda. In particular, the optimization problem solved for each block is quantum-amenable in that it could, in principle, be solved by a quantum computer.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3133825391
source	Publicly Available Content (ProQuest)
subjects	Brain Combinatorial analysis Large language models Machine learning Neural networks Optimization Performance measurement Pruning Quantum computers Surgeons
title	Scalable iterative pruning of large language and vision models using block coordinate descent
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T19%3A49%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Scalable%20iterative%20pruning%20of%20large%20language%20and%20vision%20models%20using%20block%20coordinate%20descent&rft.jtitle=arXiv.org&rft.au=Rosenberg,%20Gili&rft.date=2024-11-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3133825391%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31338253913%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3133825391&rft_id=info:pmid/&rfr_iscdi=true