Tiny-LLM

CUDA-native C++ inference engine for focused Transformer workloads.

English • 简体中文 • Documentation • Architecture • API • Changelog

Overview

Tiny-LLM keeps the repository surface deliberately small: CUDA/C++17 kernels, W8A16 quantization, explicit KV cache management, and a narrow runtime path that is easier to audit and maintain.

What is implemented

W8A16 inference path with INT8 weights and FP16 activations
Explicit KV cache management for autoregressive decoding
CUDA-native kernels with shared-memory and warp-level optimization patterns
Result<T>-based fallible APIs for host-side error propagation
GoogleTest + RapidCheck coverage for core runtime paths

Model-loading surfaces

InferenceEngine::load() currently accepts the repository's supported binary runtime format.
GGUFParser is available for GGUF parsing, metadata extraction, and tensor inspection.
Direct GGUF runtime loading is not part of the current inference path.

Build from source

Tiny-LLM requires a working CUDA toolchain (nvcc on PATH or an equivalent configured installation).

Component	Minimum
NVIDIA GPU	Compute Capability 7.0+
CUDA Toolkit	11.0+
CMake	3.18+
C++ Compiler	GCC 9+ or Clang 10+

git clone https://github.com/AICL-Lab/tiny-llm.git
cd tiny-llm

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure --timeout 300

Minimal usage example

#include <iostream>
#include <tiny_llm/inference_engine.h>

int main() {
    using namespace tiny_llm;

    ModelConfig config;
    config.vocab_size = 32000;
    config.hidden_dim = 4096;
    config.num_layers = 32;

    auto engine_result = InferenceEngine::load("model.bin", config);
    if (engine_result.isErr()) {
        std::cerr << engine_result.error() << '\n';
        return 1;
    }

    GenerationConfig gen;
    gen.max_new_tokens = 64;
    gen.temperature = 0.7f;
    gen.top_p = 0.9f;
    gen.do_sample = true;

    auto engine = std::move(engine_result.value());
    auto output = engine->generate({1, 15043, 29892}, gen);
    if (output.isErr()) {
        std::cerr << output.error() << '\n';
        return 1;
    }

    return 0;
}

Repository map

include/tiny_llm/         Public headers
src/                      Host-side C++ implementation
kernels/                  CUDA kernels
tests/                    Unit and property tests
docs/                     VitePress documentation site
.github/workflows/        CI, Pages, and release automation
CHANGELOG.md              Canonical tracked release history

Contributing

Read CONTRIBUTING.md before sending changes. Keep changes focused, keep docs aligned with the real runtime surface, and keep the repository free of duplicate workflow scaffolding.

License

Tiny-LLM is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.githooks		.githooks
.github		.github
.vscode		.vscode
docs		docs
include/tiny_llm		include/tiny_llm
kernels		kernels
src		src
tests		tests
.clang-format		.clang-format
.clangd		.clangd
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny-LLM

Overview

What is implemented

Model-loading surfaces

Build from source

Minimal usage example

Repository map

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny-LLM

Overview

What is implemented

Model-loading surfaces

Build from source

Minimal usage example

Repository map

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages