Loading…

Optimizing Cache Memory Usage Methods for Chat LLM-models in PaaS Installations

Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and t...

Full description

Saved in:
Bibliographic Details
Main Authors: Rovnyagin, Mikhail. M., Sinelnikov, Dmitry M., Eroshev, Artem A., Rovnyagina, Tatyana A., Tikhomirov, Alexander V.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recently, LLM models have become widespread in industry. They are used as the basis for voice assistants, troubleshooting systems, chatbots and much more. The work of LLM is based on the architecture of transformer-type neural networks, where text is supplied as input (for example, text chat), and the model, estimating the character-by-character probability, returns the result also in the form of text. In this case, the model does not retrain. And the context itself is constantly accumulating. This paper proposes two ways to reduce the memory allocated for storing the test chat context. The first method is to periodically launch additional training in order to embed the chat context into the core of the model itself. The article discusses the pros and cons of this approach. The second method is to save in the text chat cache only with those users where this context has already been formed. The article describes the layout for conducting the experiment, provides the results of the experimental study and describes the method for assessing the "maturity" of the chat correspondence context.
ISSN:2376-6565
DOI:10.1109/ElCon61730.2024.10468250