Loading…

Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations

Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Rovnyagin, Mikhail. M., Sinelnikov, Dmitry M., Eroshev, Artem A., Rovnyagina, Tatyana A., Tikhomirov, Alexander V.
Format:	Conference Proceeding
Language:	English
Subjects:	ChatGPT Context Embedding Large Language Model Layout Memory Allocation Personal voice assistants Predictive models Resource management Training Transformer cores Transformer-type Neural Network Transformers
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	280
container_issue
container_start_page	277
container_title
container_volume
creator	Rovnyagin, Mikhail. M. Sinelnikov, Dmitry M. Eroshev, Artem A. Rovnyagina, Tatyana A. Tikhomirov, Alexander V.
description	Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.
doi_str_mv	10.1109/ElCon61730.2024.10468250
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10468250</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10468250</ieee_id><sourcerecordid>10468250</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043</originalsourceid><addsrcrecordid>eNo1kMtKw0AUQEdBsNT-gYv5gdR75ybzWEqoWkipoF2XSXLTjORRMtnUrxfRrg5ncxZHCImwRgT3tOnycdBoCNYKVLpGSLVVGdyIlTPOUgakQZO7FQtFRic609m9WMX4BQBKoTPKLsR-f55DH77DcJK5r1qWO-7H6SIP0Z9-ZW7HOspmnGTe-lkWxS7px5q7KMMg373_kNshzr7r_BzGIT6Iu8Z3kVf_XIrDy-Yzf0uK_es2fy6SgOjmRJFKiVJTc9owK1ZWoW4curoqrecys0SNBfSEWBr0lTMMDdkanS0NpLQUj3_dwMzH8xR6P12O1wf0A3UEUE8</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><source>IEEE Xplore All Conference Series</source><creator>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</creator><creatorcontrib>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</creatorcontrib><description>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</description><identifier>EISSN: 2376-6565</identifier><identifier>EISBN: 9798350360639</identifier><identifier>EISBN: 9798350360646</identifier><identifier>DOI: 10.1109/ElCon61730.2024.10468250</identifier><language>eng</language><publisher>IEEE</publisher><subject>ChatGPT ; Context Embedding ; Large Language Model ; Layout ; Memory Allocation ; Personal voice assistants ; Predictive models ; Resource management ; Training ; Transformer cores ; Transformer-type Neural Network ; Transformers</subject><ispartof>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), 2024, p.277-280</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10468250$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10468250$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Rovnyagin, Mikhail. M.</creatorcontrib><creatorcontrib>Sinelnikov, Dmitry M.</creatorcontrib><creatorcontrib>Eroshev, Artem A.</creatorcontrib><creatorcontrib>Rovnyagina, Tatyana A.</creatorcontrib><creatorcontrib>Tikhomirov, Alexander V.</creatorcontrib><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><title>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon)</title><addtitle>ElCon</addtitle><description>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</description><subject>ChatGPT</subject><subject>Context Embedding</subject><subject>Large Language Model</subject><subject>Layout</subject><subject>Memory Allocation</subject><subject>Personal voice assistants</subject><subject>Predictive models</subject><subject>Resource management</subject><subject>Training</subject><subject>Transformer cores</subject><subject>Transformer-type Neural Network</subject><subject>Transformers</subject><issn>2376-6565</issn><isbn>9798350360639</isbn><isbn>9798350360646</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMtKw0AUQEdBsNT-gYv5gdR75ybzWEqoWkipoF2XSXLTjORRMtnUrxfRrg5ncxZHCImwRgT3tOnycdBoCNYKVLpGSLVVGdyIlTPOUgakQZO7FQtFRic609m9WMX4BQBKoTPKLsR-f55DH77DcJK5r1qWO-7H6SIP0Z9-ZW7HOspmnGTe-lkWxS7px5q7KMMg373_kNshzr7r_BzGIT6Iu8Z3kVf_XIrDy-Yzf0uK_es2fy6SgOjmRJFKiVJTc9owK1ZWoW4curoqrecys0SNBfSEWBr0lTMMDdkanS0NpLQUj3_dwMzH8xR6P12O1wf0A3UEUE8</recordid><startdate>20240129</startdate><enddate>20240129</enddate><creator>Rovnyagin, Mikhail. M.</creator><creator>Sinelnikov, Dmitry M.</creator><creator>Eroshev, Artem A.</creator><creator>Rovnyagina, Tatyana A.</creator><creator>Tikhomirov, Alexander V.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240129</creationdate><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><author>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>ChatGPT</topic><topic>Context Embedding</topic><topic>Large Language Model</topic><topic>Layout</topic><topic>Memory Allocation</topic><topic>Personal voice assistants</topic><topic>Predictive models</topic><topic>Resource management</topic><topic>Training</topic><topic>Transformer cores</topic><topic>Transformer-type Neural Network</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Rovnyagin, Mikhail. M.</creatorcontrib><creatorcontrib>Sinelnikov, Dmitry M.</creatorcontrib><creatorcontrib>Eroshev, Artem A.</creatorcontrib><creatorcontrib>Rovnyagina, Tatyana A.</creatorcontrib><creatorcontrib>Tikhomirov, Alexander V.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rovnyagin, Mikhail. M.</au><au>Sinelnikov, Dmitry M.</au><au>Eroshev, Artem A.</au><au>Rovnyagina, Tatyana A.</au><au>Tikhomirov, Alexander V.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</atitle><btitle>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon)</btitle><stitle>ElCon</stitle><date>2024-01-29</date><risdate>2024</risdate><spage>277</spage><epage>280</epage><pages>277-280</pages><eissn>2376-6565</eissn><eisbn>9798350360639</eisbn><eisbn>9798350360646</eisbn><abstract>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</abstract><pub>IEEE</pub><doi>10.1109/ElCon61730.2024.10468250</doi><tpages>4</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2376-6565
ispartof	2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), 2024, p.277-280
issn	2376-6565
language	eng
recordid	cdi_ieee_primary_10468250
source	IEEE Xplore All Conference Series
subjects	ChatGPT Context Embedding Large Language Model Layout Memory Allocation Personal voice assistants Predictive models Resource management Training Transformer cores Transformer-type Neural Network Transformers
title	Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T23%3A10%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Optimizing%20Cache%20Memory%20Usage%20Methods%20for%20Chat%20LLM-models%20in%20PaaS%20Installations&rft.btitle=2024%20Conference%20of%20Young%20Researchers%20in%20Electrical%20and%20Electronic%20Engineering%20(ElCon)&rft.au=Rovnyagin,%20Mikhail.%20M.&rft.date=2024-01-29&rft.spage=277&rft.epage=280&rft.pages=277-280&rft.eissn=2376-6565&rft_id=info:doi/10.1109/ElCon61730.2024.10468250&rft.eisbn=9798350360639&rft.eisbn_list=9798350360646&rft_dat=%3Cieee_CHZPO%3E10468250%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10468250&rfr_iscdi=true