Loading…
Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations
Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and t...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 280 |
container_issue | |
container_start_page | 277 |
container_title | |
container_volume | |
creator | Rovnyagin, Mikhail. M. Sinelnikov, Dmitry M. Eroshev, Artem A. Rovnyagina, Tatyana A. Tikhomirov, Alexander V. |
description | Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context. |
doi_str_mv | 10.1109/ElCon61730.2024.10468250 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10468250</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10468250</ieee_id><sourcerecordid>10468250</sourcerecordid><originalsourceid>FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043</originalsourceid><addsrcrecordid>eNo1kMtKw0AUQEdBsNT-gYv5gdR75ybzWEqoWkipoF2XSXLTjORRMtnUrxfRrg5ncxZHCImwRgT3tOnycdBoCNYKVLpGSLVVGdyIlTPOUgakQZO7FQtFRic609m9WMX4BQBKoTPKLsR-f55DH77DcJK5r1qWO-7H6SIP0Z9-ZW7HOspmnGTe-lkWxS7px5q7KMMg373_kNshzr7r_BzGIT6Iu8Z3kVf_XIrDy-Yzf0uK_es2fy6SgOjmRJFKiVJTc9owK1ZWoW4curoqrecys0SNBfSEWBr0lTMMDdkanS0NpLQUj3_dwMzH8xR6P12O1wf0A3UEUE8</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><source>IEEE Xplore All Conference Series</source><creator>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</creator><creatorcontrib>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</creatorcontrib><description>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</description><identifier>EISSN: 2376-6565</identifier><identifier>EISBN: 9798350360639</identifier><identifier>EISBN: 9798350360646</identifier><identifier>DOI: 10.1109/ElCon61730.2024.10468250</identifier><language>eng</language><publisher>IEEE</publisher><subject>ChatGPT ; Context Embedding ; Large Language Model ; Layout ; Memory Allocation ; Personal voice assistants ; Predictive models ; Resource management ; Training ; Transformer cores ; Transformer-type Neural Network ; Transformers</subject><ispartof>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), 2024, p.277-280</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10468250$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10468250$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Rovnyagin, Mikhail. M.</creatorcontrib><creatorcontrib>Sinelnikov, Dmitry M.</creatorcontrib><creatorcontrib>Eroshev, Artem A.</creatorcontrib><creatorcontrib>Rovnyagina, Tatyana A.</creatorcontrib><creatorcontrib>Tikhomirov, Alexander V.</creatorcontrib><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><title>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon)</title><addtitle>ElCon</addtitle><description>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</description><subject>ChatGPT</subject><subject>Context Embedding</subject><subject>Large Language Model</subject><subject>Layout</subject><subject>Memory Allocation</subject><subject>Personal voice assistants</subject><subject>Predictive models</subject><subject>Resource management</subject><subject>Training</subject><subject>Transformer cores</subject><subject>Transformer-type Neural Network</subject><subject>Transformers</subject><issn>2376-6565</issn><isbn>9798350360639</isbn><isbn>9798350360646</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo1kMtKw0AUQEdBsNT-gYv5gdR75ybzWEqoWkipoF2XSXLTjORRMtnUrxfRrg5ncxZHCImwRgT3tOnycdBoCNYKVLpGSLVVGdyIlTPOUgakQZO7FQtFRic609m9WMX4BQBKoTPKLsR-f55DH77DcJK5r1qWO-7H6SIP0Z9-ZW7HOspmnGTe-lkWxS7px5q7KMMg373_kNshzr7r_BzGIT6Iu8Z3kVf_XIrDy-Yzf0uK_es2fy6SgOjmRJFKiVJTc9owK1ZWoW4curoqrecys0SNBfSEWBr0lTMMDdkanS0NpLQUj3_dwMzH8xR6P12O1wf0A3UEUE8</recordid><startdate>20240129</startdate><enddate>20240129</enddate><creator>Rovnyagin, Mikhail. M.</creator><creator>Sinelnikov, Dmitry M.</creator><creator>Eroshev, Artem A.</creator><creator>Rovnyagina, Tatyana A.</creator><creator>Tikhomirov, Alexander V.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20240129</creationdate><title>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</title><author>Rovnyagin, Mikhail. M. ; Sinelnikov, Dmitry M. ; Eroshev, Artem A. ; Rovnyagina, Tatyana A. ; Tikhomirov, Alexander V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>ChatGPT</topic><topic>Context Embedding</topic><topic>Large Language Model</topic><topic>Layout</topic><topic>Memory Allocation</topic><topic>Personal voice assistants</topic><topic>Predictive models</topic><topic>Resource management</topic><topic>Training</topic><topic>Transformer cores</topic><topic>Transformer-type Neural Network</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Rovnyagin, Mikhail. M.</creatorcontrib><creatorcontrib>Sinelnikov, Dmitry M.</creatorcontrib><creatorcontrib>Eroshev, Artem A.</creatorcontrib><creatorcontrib>Rovnyagina, Tatyana A.</creatorcontrib><creatorcontrib>Tikhomirov, Alexander V.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rovnyagin, Mikhail. M.</au><au>Sinelnikov, Dmitry M.</au><au>Eroshev, Artem A.</au><au>Rovnyagina, Tatyana A.</au><au>Tikhomirov, Alexander V.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations</atitle><btitle>2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon)</btitle><stitle>ElCon</stitle><date>2024-01-29</date><risdate>2024</risdate><spage>277</spage><epage>280</epage><pages>277-280</pages><eissn>2376-6565</eissn><eisbn>9798350360639</eisbn><eisbn>9798350360646</eisbn><abstract>Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.</abstract><pub>IEEE</pub><doi>10.1109/ElCon61730.2024.10468250</doi><tpages>4</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2376-6565 |
ispartof | 2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), 2024, p.277-280 |
issn | 2376-6565 |
language | eng |
recordid | cdi_ieee_primary_10468250 |
source | IEEE Xplore All Conference Series |
subjects | ChatGPT Context Embedding Large Language Model Layout Memory Allocation Personal voice assistants Predictive models Resource management Training Transformer cores Transformer-type Neural Network Transformers |
title | Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T23%3A10%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Optimizing%20Cache%20Memory%20Usage%20Methods%20for%20Chat%20LLM-models%20in%20PaaS%20Installations&rft.btitle=2024%20Conference%20of%20Young%20Researchers%20in%20Electrical%20and%20Electronic%20Engineering%20(ElCon)&rft.au=Rovnyagin,%20Mikhail.%20M.&rft.date=2024-01-29&rft.spage=277&rft.epage=280&rft.pages=277-280&rft.eissn=2376-6565&rft_id=info:doi/10.1109/ElCon61730.2024.10468250&rft.eisbn=9798350360639&rft.eisbn_list=9798350360646&rft_dat=%3Cieee_CHZPO%3E10468250%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i119t-23243347de4fee2e28216f919dcb8aeb5833f801a311b71ac97e0f38d198b7043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10468250&rfr_iscdi=true |