Loading…

Marconi: Prefix Caching for the Era of Hybrid LLMs

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-12
Main Authors:	Pan, Rui, Wang, Zhuang, Jia, Zhen, Karakus, Can, Zancato, Luca, Dao, Tri, Wang, Yida, Netravali, Ravi
Format:	Article
Language:	English
Subjects:	Caching Large language models State space models Taxonomy
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Pan, Rui Wang, Zhuang Jia, Zhen Karakus, Can Zancato, Luca Dao, Tri Wang, Yida Netravali, Ravi
description	Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3134987133</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3134987133</sourcerecordid><originalsourceid>FETCH-proquest_journals_31349871333</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8k0sSs7Py7RSCChKTcusUHBOTM7IzEtXSMsvUijJSFVwLUpUyE9T8KhMKspMUfDx8S3mYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0NjE0sLc0NjY2PiVAEAOAQyWA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3134987133</pqid></control><display><type>article</type><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><source>Publicly Available Content Database</source><creator>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</creator><creatorcontrib>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</creatorcontrib><description>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Caching ; Large language models ; State space models ; Taxonomy</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3134987133?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Pan, Rui</creatorcontrib><creatorcontrib>Wang, Zhuang</creatorcontrib><creatorcontrib>Jia, Zhen</creatorcontrib><creatorcontrib>Karakus, Can</creatorcontrib><creatorcontrib>Zancato, Luca</creatorcontrib><creatorcontrib>Dao, Tri</creatorcontrib><creatorcontrib>Wang, Yida</creatorcontrib><creatorcontrib>Netravali, Ravi</creatorcontrib><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><title>arXiv.org</title><description>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</description><subject>Caching</subject><subject>Large language models</subject><subject>State space models</subject><subject>Taxonomy</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8k0sSs7Py7RSCChKTcusUHBOTM7IzEtXSMsvUijJSFVwLUpUyE9T8KhMKspMUfDx8S3mYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0NjE0sLc0NjY2PiVAEAOAQyWA</recordid><startdate>20241204</startdate><enddate>20241204</enddate><creator>Pan, Rui</creator><creator>Wang, Zhuang</creator><creator>Jia, Zhen</creator><creator>Karakus, Can</creator><creator>Zancato, Luca</creator><creator>Dao, Tri</creator><creator>Wang, Yida</creator><creator>Netravali, Ravi</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241204</creationdate><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><author>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31349871333</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Caching</topic><topic>Large language models</topic><topic>State space models</topic><topic>Taxonomy</topic><toplevel>online_resources</toplevel><creatorcontrib>Pan, Rui</creatorcontrib><creatorcontrib>Wang, Zhuang</creatorcontrib><creatorcontrib>Jia, Zhen</creatorcontrib><creatorcontrib>Karakus, Can</creatorcontrib><creatorcontrib>Zancato, Luca</creatorcontrib><creatorcontrib>Dao, Tri</creatorcontrib><creatorcontrib>Wang, Yida</creatorcontrib><creatorcontrib>Netravali, Ravi</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pan, Rui</au><au>Wang, Zhuang</au><au>Jia, Zhen</au><au>Karakus, Can</au><au>Zancato, Luca</au><au>Dao, Tri</au><au>Wang, Yida</au><au>Netravali, Ravi</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Marconi: Prefix Caching for the Era of Hybrid LLMs</atitle><jtitle>arXiv.org</jtitle><date>2024-12-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3134987133
source	Publicly Available Content Database
subjects	Caching Large language models State space models Taxonomy
title	Marconi: Prefix Caching for the Era of Hybrid LLMs
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T18%3A46%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Marconi:%20Prefix%20Caching%20for%20the%20Era%20of%20Hybrid%20LLMs&rft.jtitle=arXiv.org&rft.au=Pan,%20Rui&rft.date=2024-12-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3134987133%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31349871333%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3134987133&rft_id=info:pmid/&rfr_iscdi=true