Loading…
Marconi: Prefix Caching for the Era of Hybrid LLMs
Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the...
Saved in:
Published in: | arXiv.org 2024-12 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Pan, Rui Wang, Zhuang Jia, Zhen Karakus, Can Zancato, Luca Dao, Tri Wang, Yida Netravali, Ravi |
description | Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3134987133</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3134987133</sourcerecordid><originalsourceid>FETCH-proquest_journals_31349871333</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8k0sSs7Py7RSCChKTcusUHBOTM7IzEtXSMsvUijJSFVwLUpUyE9T8KhMKspMUfDx8S3mYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0NjE0sLc0NjY2PiVAEAOAQyWA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3134987133</pqid></control><display><type>article</type><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><source>Publicly Available Content Database</source><creator>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</creator><creatorcontrib>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</creatorcontrib><description>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Caching ; Large language models ; State space models ; Taxonomy</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3134987133?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Pan, Rui</creatorcontrib><creatorcontrib>Wang, Zhuang</creatorcontrib><creatorcontrib>Jia, Zhen</creatorcontrib><creatorcontrib>Karakus, Can</creatorcontrib><creatorcontrib>Zancato, Luca</creatorcontrib><creatorcontrib>Dao, Tri</creatorcontrib><creatorcontrib>Wang, Yida</creatorcontrib><creatorcontrib>Netravali, Ravi</creatorcontrib><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><title>arXiv.org</title><description>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</description><subject>Caching</subject><subject>Large language models</subject><subject>State space models</subject><subject>Taxonomy</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw8k0sSs7Py7RSCChKTcusUHBOTM7IzEtXSMsvUijJSFVwLUpUyE9T8KhMKspMUfDx8S3mYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0NjE0sLc0NjY2PiVAEAOAQyWA</recordid><startdate>20241204</startdate><enddate>20241204</enddate><creator>Pan, Rui</creator><creator>Wang, Zhuang</creator><creator>Jia, Zhen</creator><creator>Karakus, Can</creator><creator>Zancato, Luca</creator><creator>Dao, Tri</creator><creator>Wang, Yida</creator><creator>Netravali, Ravi</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241204</creationdate><title>Marconi: Prefix Caching for the Era of Hybrid LLMs</title><author>Pan, Rui ; Wang, Zhuang ; Jia, Zhen ; Karakus, Can ; Zancato, Luca ; Dao, Tri ; Wang, Yida ; Netravali, Ravi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31349871333</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Caching</topic><topic>Large language models</topic><topic>State space models</topic><topic>Taxonomy</topic><toplevel>online_resources</toplevel><creatorcontrib>Pan, Rui</creatorcontrib><creatorcontrib>Wang, Zhuang</creatorcontrib><creatorcontrib>Jia, Zhen</creatorcontrib><creatorcontrib>Karakus, Can</creatorcontrib><creatorcontrib>Zancato, Luca</creatorcontrib><creatorcontrib>Dao, Tri</creatorcontrib><creatorcontrib>Wang, Yida</creatorcontrib><creatorcontrib>Netravali, Ravi</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pan, Rui</au><au>Wang, Zhuang</au><au>Jia, Zhen</au><au>Karakus, Can</au><au>Zancato, Luca</au><au>Dao, Tri</au><au>Wang, Yida</au><au>Netravali, Ravi</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Marconi: Prefix Caching for the Era of Hybrid LLMs</atitle><jtitle>arXiv.org</jtitle><date>2024-12-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4\(\times\) higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3134987133 |
source | Publicly Available Content Database |
subjects | Caching Large language models State space models Taxonomy |
title | Marconi: Prefix Caching for the Era of Hybrid LLMs |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T18%3A46%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Marconi:%20Prefix%20Caching%20for%20the%20Era%20of%20Hybrid%20LLMs&rft.jtitle=arXiv.org&rft.au=Pan,%20Rui&rft.date=2024-12-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3134987133%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31349871333%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3134987133&rft_id=info:pmid/&rfr_iscdi=true |