Loading…

Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations

Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and t...

Full description

Saved in:
Bibliographic Details
Main Authors: Rovnyagin, Mikhail. M., Sinelnikov, Dmitry M., Eroshev, Artem A., Rovnyagina, Tatyana A., Tikhomirov, Alexander V.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 280
container_issue
container_start_page 277
container_title
container_volume
creator Rovnyagin, Mikhail. M.
Sinelnikov, Dmitry M.
Eroshev, Artem A.
Rovnyagina, Tatyana A.
Tikhomirov, Alexander V.
description Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.
doi_str_mv 10.1109/ElCon61730.2024.10468250
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10468250</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10468250</ieee_id><sourcerecordid>10468250</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043</originalsourceid><addsrcrecordid>eNo1kMtKw0AUQEdBsNT-gYv5gdR75ybzWEqoWkipoF2XSXLTjORRMtnUrxfRrg5ncxZHCImwRgT3tOnycdBoCNYKVLpGSLVVGdyIlTPOUgakQZO7FQtFRic609m9WMX4BQBKoTPKLsR-f55DH77DcJK5r1qWO-7H6SIP0Z9-ZW7HOspmnGTe-lkWxS7px5q7KMMg373_kNshzr7r_BzGIT6Iu8Z3kVf_XIrDy-Yzf0uK_es2fy6SgOjmRJFKiVJTc9owK1ZWoW4curoqrecys0SNBfSEWBr0lTMMDdkanS0NpLQUj3_dwMzH8xR6P12O1wf0A3UEUE8</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><source>IEEE Xplore All Conference Series</source><creator>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</creator><creatorcontrib>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</creatorcontrib><description>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</description><identifier>EISSN: 2376-6565</identifier><identifier>EISBN: 9798350360639</identifier><identifier>EISBN: 9798350360646</identifier><identifier>DOI: 10.1109/ElCon61730.2024.10468250</identifier><language>eng</language><publisher>IEEE</publisher><subject>ChatGPT ; Context Embedding ; Large Language Model ; Layout ; Memory Allocation ; Personal voice assistants ; Predictive models ; Resource management ; Training ; Transformer cores ; Transformer-type Neural Network ; Transformers</subject><ispartof>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), 2024, p.277-280</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10468250$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10468250$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Rovnyagin, Mikhail. M.</creatorcontrib><creatorcontrib>Sinelnikov, Dmitry M.</creatorcontrib><creatorcontrib>Eroshev, Artem A.</creatorcontrib><creatorcontrib>Rovnyagina, Tatyana A.</creatorcontrib><creatorcontrib>Tikhomirov, Alexander V.</creatorcontrib><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><title>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon)</title><addtitle>ElCon</addtitle><description>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</description><subject>ChatGPT</subject><subject>Context Embedding</subject><subject>Large Language Model</subject><subject>Layout</subject><subject>Memory Allocation</subject><subject>Personal voice assistants</subject><subject>Predictive models</subject><subject>Resource management</subject><subject>Training</subject><subject>Transformer cores</subject><subject>Transformer-type Neural Network</subject><subject>Transformers</subject><issn>2376-6565</issn><isbn>9798350360639</isbn><isbn>9798350360646</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMtKw0AUQEdBsNT-gYv5gdR75ybzWEqoWkipoF2XSXLTjORRMtnUrxfRrg5ncxZHCImwRgT3tOnycdBoCNYKVLpGSLVVGdyIlTPOUgakQZO7FQtFRic609m9WMX4BQBKoTPKLsR-f55DH77DcJK5r1qWO-7H6SIP0Z9-ZW7HOspmnGTe-lkWxS7px5q7KMMg373_kNshzr7r_BzGIT6Iu8Z3kVf_XIrDy-Yzf0uK_es2fy6SgOjmRJFKiVJTc9owK1ZWoW4curoqrecys0SNBfSEWBr0lTMMDdkanS0NpLQUj3_dwMzH8xR6P12O1wf0A3UEUE8</recordid><startdate>20240129</startdate><enddate>20240129</enddate><creator>Rovnyagin, Mikhail. M.</creator><creator>Sinelnikov, Dmitry M.</creator><creator>Eroshev, Artem A.</creator><creator>Rovnyagina, Tatyana A.</creator><creator>Tikhomirov, Alexander V.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240129</creationdate><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><author>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>ChatGPT</topic><topic>Context Embedding</topic><topic>Large Language Model</topic><topic>Layout</topic><topic>Memory Allocation</topic><topic>Personal voice assistants</topic><topic>Predictive models</topic><topic>Resource management</topic><topic>Training</topic><topic>Transformer cores</topic><topic>Transformer-type Neural Network</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Rovnyagin, Mikhail. M.</creatorcontrib><creatorcontrib>Sinelnikov, Dmitry M.</creatorcontrib><creatorcontrib>Eroshev, Artem A.</creatorcontrib><creatorcontrib>Rovnyagina, Tatyana A.</creatorcontrib><creatorcontrib>Tikhomirov, Alexander V.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rovnyagin, Mikhail. M.</au><au>Sinelnikov, Dmitry M.</au><au>Eroshev, Artem A.</au><au>Rovnyagina, Tatyana A.</au><au>Tikhomirov, Alexander V.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</atitle><btitle>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon)</btitle><stitle>ElCon</stitle><date>2024-01-29</date><risdate>2024</risdate><spage>277</spage><epage>280</epage><pages>277-280</pages><eissn>2376-6565</eissn><eisbn>9798350360639</eisbn><eisbn>9798350360646</eisbn><abstract>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</abstract><pub>IEEE</pub><doi>10.1109/ElCon61730.2024.10468250</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2376-6565
ispartof 2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), 2024, p.277-280
issn 2376-6565
language eng
recordid cdi_ieee_primary_10468250
source IEEE Xplore All Conference Series
subjects ChatGPT
Context Embedding
Large Language Model
Layout
Memory Allocation
Personal voice assistants
Predictive models
Resource management
Training
Transformer cores
Transformer-type Neural Network
Transformers
title Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T23%3A10%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Optimizing%20Cache%20Memory%20Usage%20Methods%20for%20Chat%20LLM-models%20in%20PaaS%20Installations&rft.btitle=2024%20Conference%20of%20Young%20Researchers%20in%20Electrical%20and%20Electronic%20Engineering%20(ElCon)&rft.au=Rovnyagin,%20Mikhail.%20M.&rft.date=2024-01-29&rft.spage=277&rft.epage=280&rft.pages=277-280&rft.eissn=2376-6565&rft_id=info:doi/10.1109/ElCon61730.2024.10468250&rft.eisbn=9798350360639&rft.eisbn_list=9798350360646&rft_dat=%3Cieee_CHZPO%3E10468250%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10468250&rfr_iscdi=true