Loading…

Marconi: Prefix Caching for the Era of Hybrid LLMs

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-12
Main Authors: Pan, Rui, Wang, Zhuang, Jia, Zhen, Karakus, Can, Zancato, Luca, Dao, Tri, Wang, Yida, Netravali, Ravi
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Pan, Rui
Wang, Zhuang
Jia, Zhen
Karakus, Can
Zancato, Luca
Dao, Tri
Wang, Yida
Netravali, Ravi
description Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3134987133</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3134987133</sourcerecordid><originalsourceid>FETCH-proquest_journals_31349871333</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8k0sSs7Py7RSCChKTcusUHBOTM7IzEtXSMsvUijJSFVwLUpUyE9T8KhMKspMUfDx8S3mYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0NjE0sLc0NjY2PiVAEAOAQyWA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3134987133</pqid></control><display><type>article</type><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><source>Publicly Available Content Database</source><creator>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</creator><creatorcontrib>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</creatorcontrib><description>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Caching ; Large language models ; State space models ; Taxonomy</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3134987133?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Pan, Rui</creatorcontrib><creatorcontrib>Wang, Zhuang</creatorcontrib><creatorcontrib>Jia, Zhen</creatorcontrib><creatorcontrib>Karakus, Can</creatorcontrib><creatorcontrib>Zancato, Luca</creatorcontrib><creatorcontrib>Dao, Tri</creatorcontrib><creatorcontrib>Wang, Yida</creatorcontrib><creatorcontrib>Netravali, Ravi</creatorcontrib><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><title>arXiv.org</title><description>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</description><subject>Caching</subject><subject>Large language models</subject><subject>State space models</subject><subject>Taxonomy</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8k0sSs7Py7RSCChKTcusUHBOTM7IzEtXSMsvUijJSFVwLUpUyE9T8KhMKspMUfDx8S3mYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0NjE0sLc0NjY2PiVAEAOAQyWA</recordid><startdate>20241204</startdate><enddate>20241204</enddate><creator>Pan, Rui</creator><creator>Wang, Zhuang</creator><creator>Jia, Zhen</creator><creator>Karakus, Can</creator><creator>Zancato, Luca</creator><creator>Dao, Tri</creator><creator>Wang, Yida</creator><creator>Netravali, Ravi</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241204</creationdate><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><author>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31349871333</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Caching</topic><topic>Large language models</topic><topic>State space models</topic><topic>Taxonomy</topic><toplevel>online_resources</toplevel><creatorcontrib>Pan, Rui</creatorcontrib><creatorcontrib>Wang, Zhuang</creatorcontrib><creatorcontrib>Jia, Zhen</creatorcontrib><creatorcontrib>Karakus, Can</creatorcontrib><creatorcontrib>Zancato, Luca</creatorcontrib><creatorcontrib>Dao, Tri</creatorcontrib><creatorcontrib>Wang, Yida</creatorcontrib><creatorcontrib>Netravali, Ravi</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pan, Rui</au><au>Wang, Zhuang</au><au>Jia, Zhen</au><au>Karakus, Can</au><au>Zancato, Luca</au><au>Dao, Tri</au><au>Wang, Yida</au><au>Netravali, Ravi</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Marconi: Prefix Caching for the Era of Hybrid LLMs</atitle><jtitle>arXiv.org</jtitle><date>2024-12-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_3134987133
source Publicly Available Content Database
subjects Caching
Large language models
State space models
Taxonomy
title Marconi: Prefix Caching for the Era of Hybrid LLMs
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T18%3A46%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Marconi:%20Prefix%20Caching%20for%20the%20Era%20of%20Hybrid%20LLMs&rft.jtitle=arXiv.org&rft.au=Pan,%20Rui&rft.date=2024-12-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3134987133%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31349871333%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3134987133&rft_id=info:pmid/&rfr_iscdi=true