_
 _   _ _ __ (_)_   _ _ __ ___
| | | | '_ \| | | | | '_ ` _ \
| |_| | | | | | |_| | | | | | |
 \__,_|_| |_|_|\__,_|_| |_| |_|
            projects

[ Back ]

LLMpp

Source Code (Codeberg) | Raw Page

LLMpp is a lightweight LLM inference engine built entirely from scratch in C++. It has absolutely no AI/ML dependencies (no PyTorch, no GGML, no external tensor libraries, etc). It features a custom tensor library with AVX2 SIMD magic, int8 quantization, a custom BPE tokenizer, and an interactive chat interface.

Stuff it has

Tensor Library: Built from the ground up with a custom strided buffer engine.
AVX2 SIMD: Custom matvec multiplication that beats often Intel MKL. (NOTE: This was tested on AMD, where MKL is a petty bastard and a good bit slower. God damn you, Intel!)
Int8 Weight Quantization (--q8): Compresses models for lower memory usage.
BPE Tokenizer: Loads native HuggingFace tokenizer.json files directly.
Multi-Threaded Inference: Uses a custom thread pool for parallel processing.
Interactive Chat Interface: Supports ChatML prompt formatting (<|im_start|> / <|im_end|>).
Sampling Capabilities: Full support for temperature, top-k, top-p, and repetition penalty.
Safetensors Support: Loading of standard safetensors model weights.

Performance & Benchmarks

Matrix Math (MatVec)

The SIMD matvec operation for a 4096x4096 matrix hits an average of ₁.8-1.9ms on DDR4 3200 dual-channel memory. This is very close to the memory bandwidth floor of ₁.6ms (64MB / 40GB/s). Thing is, CPU inference is very memory bound, and this implementation gets extremely close to the limit. There can however, still be some changes done to make it faster, but, well, its good enough™ for now...

[200.3 KiB] matvec benchmark

Tokenizer Speed

When it comes to the BPE tokenizer, llmpp is pretty damn good. It actually outperforms OpenAI's tiktoken in encoding. In a 5,000 character encode test, llmpp is roughly 53.6% faster, and only loses by a 0.7x margin when decoding 2,000 tokens.

However...

Intel's MKL is slower on AMD CPUs. These benchmarks were run on an AMD Ryzen 5800HS.
Tiktoken is a more "general purpose" tool and has to account for edge cases that LLMpp's tokenizer skips/doesn't care about.
Tiktoken was not multi-threaded during this specific benchmark.

[125.6 KiB] tokenizer benchmark

installation

stuff you need

g++/clang++ with c++20 support
make
x86_64 cpu with avx2 (for simd)

build

1) clone the repo:

git clone https://github.com/TheUnium/llmpp.git
cd llmpp

2) build:

make

3) run:

./llm <model_dir> <tokenizer.json> [options]

running tests

make tests
./llm_tests # this runs all tests
./llm_tests --tests tensor/simd # this runs the specified tests

available test modules: tensor/tensor, tensor/ops, tensor/simd, tensor/qnt8, thread/thread, tokenizer/bpe

usage

usage:
  ./llm <model_dir> <tokenizer.json> [options]

options:
  --q8                  quantize weights to int8
  --max-tokens <int>    max generation tokens (default: 512)
  --temp <float>        temperature (default: 0.7)
  --top-k <int>         top-k (default: 40)
  --top-p <float>       top-p (default: 0.9)
  --rep-penalty <float> repetition penalty (default: 1.2)
  --system <string>     system prompt (default: built-in)

chat commands: /quit, /clear, /help

Contact

GitHub: @TheUnium
Email: