openEuler Support for LLMs

openEuler Blog Maintainer2023-09-26LLMChatGLMLLaMa

Introduction to LLM

This blog describes how to install and develop a Large Language Model (LLM) on openEuler. LLMs are artificial intelligence models designed to understand and generate human languages. They are trained on a large amount of text data and can perform a wide range of tasks, including text summary, translation, sentiment analysis, and so on.

This article is intended for community developers, open source enthusiasts, and partners who use the openEuler OS and want to learn and use LLM. Users must have basic knowledge of the Linux OS.

chatglm-cpp Usage Guide

Introduction

chatglm-cpp is a ChatGLM model interface based on C/C++. It allows users to deploy and use open source models on CPU machines.

It supports the deployment of multiple open source Chinese LLMs, such as ChatGLM-6B, ChatGLM2-6B, and Baichuan-13B.

Software Architecture

The chatglm-cpp core architecture consists of two layers:

Quantization layer: quantifies open source models to reduce the model size.
Startup layer: starts a quantized model.

Features:

Implemented in GGML-based C/C++.
Accelerated CPU inference with int4/int8 quantization, optimized KV cache and parallel computing.
Streaming generation with typewriter effect.
Can run solely on CPUs, with no GPUs required

Installation Guide

Software and Hardware Requirements

Processor architecture: AArch64 and x86_64

OS: openEuler 23.09

Memory: at least 4 GB, depending on the size of different open source models.

Installing Components

Before using chatglm-cpp to deploy an LLM, install the chatglm-cpp software package. Before the installation, ensure that the openEuler yum source has been configured.

Install the package.

yum install chatglm-cpp

Check whether the package is installed.

chatglm_cpp_main -h

If the help information is displayed, the installation is successful.

Instructions

No Containers

Install the chatglm-cpp software package.

yum install chatglm-cpp

Download open source LLMs, such as ChatGLM-6B and ChatGLM2-6B. Use chatglm_convert.py to quantify the downloaded open source LLM.

python3 /usr/bin/chatglm_convert.py -i model_path/ -t q4_0 -o chatglm-ggml_1.bin

model_path indicates the path for storing the open source LLM, q4_0 indicates the quantization precision, and chatglm-ggml_1.bin indicates the name of the output model.

Start the model and start a dialog.

chatglm_cpp_main -m model_path  -i

model_path indicates the path for storing the quantized model.

You can run the following command to view the usage of command line options:

chatglm_cpp_main -h

Using a Container

Pull a container image.

docker pull hub.oepkgs.net/openeuler/chatglm_image

Run the container image and start a dialog.

docker run -it --security-opt seccomp=unconfined hub.oepkgs.net/openeuler/chatglm_image

Normal Startup Page

Figure 1 shows the interface after the ChatGLM2-6B model is started.

Figure 1 ChatGLM2-6B model startup interface

Specifications

This project allows you to deploy and infer LLMs on CPU-level machines. However, the model inference speed is hardware-dependent. If the hardware is set to the lowest settings, the inference speed may be slow, which can lower usage efficiency.

Table 1 lists reference inference speed in different machine configurations.

The table shows that Q4_0, Q4_1, Q5_0, and Q5_1 represent the quantization precision, while ms/token indicates the duration in milliseconds spent by each token inference, which is a measure of inference speed. A smaller value for ms/token indicates a higher inference speed.

Table 1 Test data of the ChatGLM-6B model inference speed

ChatGLM-6B	Q4_0	Q4_1	Q5_0	Q5_1
ms/token (CPU @ Platinum 8260)	74	77	86	89
Model size	3.3 GB	3.7 GB	4.0 GB	4.4 GB
Memory usage	4.0 GB	4.4 GB	4.7 GB	5.1 GB

llama.cpp Usage Guide

Introduction

llama.cpp is a LLaMA model interface based on C/C++. It allows users to deploy and run open source models on CPUs.

llama.cpp supports the deployment of multiple open source English LLMs, such as LLaMA, LLaMA2, and Vicuna.

Software Architecture

The llama.cpp core architecture consists of two layers:

Quantization layer: quantifies open source models to reduce the model size.
Startup layer: starts a quantized model.

Features:

Implemented in GGML-based C/C++.
Accelerated CPU inference with int4/int8 quantization, optimized KV cache and parallel computing.
Streaming generation with typewriter effect.
Can run solely on CPUs, with no GPUs required

Installation Guide

Software and Hardware Requirements

Processor architecture: AArch64 and X86_64

OS: openEuler 23.09

Memory: at least 4 GB, depending on the size of different open source models.

Installing Components

Before using llama.cpp to deploy an LLM, install the llama.cpp software package. Before the installation, ensure that the openEuler yum source has been configured.

Install the package.

yum install llama.cpp

Check whether the package is installed.

llama_cpp_main -h

If the help information is displayed, the installation is successful.

Instructions

No Containers

Install the llama.cpp software package.

yum install llama.cpp

Download open source LLMs, such as LLaMa and LLaMa2. Use llama_convert.py to quantify the downloaded open source LLM.

python3 /usr/bin/llama_convert.py  model_path/

model_path indicates the path for storing the open source LLM.

Start the model and start a dialog.

llama_cpp_main -m model_path --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

model_path indicates the path for storing the quantized model.

You can run the following command to view the usage of command line options:

llama_cpp_main -h

Using a Container

Pull a container image.

docker pull hub.oepkgs.net/openeuler/llama_image

Run the container image and start a dialog.

docker run -it --security-opt seccomp=unconfined hub.oepkgs.net/openeuler/llama_image

Normal Startup Page

Figure 2 shows the GUI after the LLaMa2-7B model is started.

Figure 2 LLaMa2-7B model startup interface

Specifications

Table 2 lists reference inference speed in different machine configurations.

Table 2 Test data of the LLaMa-7B model inference speed

LLaMa-7B	Q4_0	Q4_1	Q5_0	Q5_1
ms/token (CPU @ Platinum 8260)	55	54	76	83
Model size	3.5 GB	3.9 GB	4.3 GB	6.7 GB
Memory usage	3.9 GB	4.2 GB	4.5 GB	5.0 GB

[Copyright] Copyright © 2025 openEuler Community. This article is first released by the openEuler community. Please reproduce it in compliance with the CC-BY-SA 4.0 license. Please note the text and keep the original link and author information when reproducing the article.

[Disclaimer] This article only represents the author's opinions, and is irrelevant to this website. This website is neutral in terms of the statements and opinions in this article, and does not provide any express or implied warranty of accuracy, reliability, or completeness of the contents contained therein. This article is for readers' reference only, and all legal responsibilities arising therefrom are borne by the reader himself.

openEuler Support for LLMs

Introduction to LLM ​

chatglm-cpp Usage Guide ​

Introduction ​

Software Architecture ​

Installation Guide ​

Software and Hardware Requirements ​

Installing Components ​

Instructions ​

No Containers ​

Using a Container ​

Normal Startup Page ​

Specifications ​

llama.cpp Usage Guide ​

Introduction ​

Software Architecture ​

Installation Guide ​

Software and Hardware Requirements ​

Installing Components ​

Instructions ​

No Containers ​

Using a Container ​

Normal Startup Page ​

Specifications ​

Feedback

Introduction to LLM

chatglm-cpp Usage Guide

Introduction

Software Architecture

Installation Guide

Software and Hardware Requirements

Installing Components

Instructions

No Containers

Using a Container

Normal Startup Page

Specifications

llama.cpp Usage Guide

Introduction

Software Architecture

Installation Guide

Software and Hardware Requirements

Installing Components

Instructions

No Containers

Using a Container

Normal Startup Page

Specifications