Loading…

Implementation of language models within an infrastructure designed for Natural Language Processing

This paper explores cost-effective alternatives for resource-constrained environments in the context of language models by investigating methods such as quantization and CPUbased model implementations. The study addresses the computational efficiency of language models during inference and the devel...

Full description

Saved in:

Bibliographic Details
Published in:	International Journal of Electronics and Telecommunications 2024-03, Vol.70 (No 1)
Main Authors:	Bartosz Walkowiak, Tomasz Walkowiak
Format:	Article
Language:	English
Subjects:	clarin-pl e5 model language model deployment llama-2 llama.cpp onnx quantization
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper explores cost-effective alternatives for resource-constrained environments in the context of language models by investigating methods such as quantization and CPUbased model implementations. The study addresses the computational efficiency of language models during inference and the development of infrastructure for text document processing. The paper discusses related technologies, the CLARIN-PL infrastructure architecture, and implementations of small and large language models. The emphasis is on model formats, data precision, and runtime environments (GPU and CPU). It identifies optimal solutions through extensive experimentation. In addition, the paper advocates for a more comprehensive performance evaluation approach. Instead of reporting only average token throughput, it suggests considering the curve’s shape, which can vary from constant to monotonically increasing or decreasing functions. Evaluating token throughput at various curve points, especially for different output token counts, provides a more informative perspective.
ISSN:	2081-8491 2300-1933
DOI:	10.24425/ijet.2024.149525