CUDA-native C++ inference engine for focused Transformer workloads.
English • 简体中文 • Documentation • Architecture • API • Changelog
Tiny-LLM keeps the repository surface deliberately small: CUDA/C++17 kernels, W8A16 quantization, explicit KV cache management, and a narrow runtime path that is easier to audit and maintain.
- W8A16 inference path with INT8 weights and FP16 activations
- Explicit KV cache management for autoregressive decoding
- CUDA-native kernels with shared-memory and warp-level optimization patterns
Result<T>-based fallible APIs for host-side error propagation- GoogleTest + RapidCheck coverage for core runtime paths
InferenceEngine::load()currently accepts the repository's supported binary runtime format.GGUFParseris available for GGUF parsing, metadata extraction, and tensor inspection.- Direct GGUF runtime loading is not part of the current inference path.
Tiny-LLM requires a working CUDA toolchain (nvcc on PATH or an equivalent configured installation).
| Component | Minimum |
|---|---|
| NVIDIA GPU | Compute Capability 7.0+ |
| CUDA Toolkit | 11.0+ |
| CMake | 3.18+ |
| C++ Compiler | GCC 9+ or Clang 10+ |
git clone https://github.com/AICL-Lab/tiny-llm.git
cd tiny-llm
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure --timeout 300#include <iostream>
#include <tiny_llm/inference_engine.h>
int main() {
using namespace tiny_llm;
ModelConfig config;
config.vocab_size = 32000;
config.hidden_dim = 4096;
config.num_layers = 32;
auto engine_result = InferenceEngine::load("model.bin", config);
if (engine_result.isErr()) {
std::cerr << engine_result.error() << '\n';
return 1;
}
GenerationConfig gen;
gen.max_new_tokens = 64;
gen.temperature = 0.7f;
gen.top_p = 0.9f;
gen.do_sample = true;
auto engine = std::move(engine_result.value());
auto output = engine->generate({1, 15043, 29892}, gen);
if (output.isErr()) {
std::cerr << output.error() << '\n';
return 1;
}
return 0;
}include/tiny_llm/ Public headers
src/ Host-side C++ implementation
kernels/ CUDA kernels
tests/ Unit and property tests
docs/ VitePress documentation site
.github/workflows/ CI, Pages, and release automation
CHANGELOG.md Canonical tracked release history
Read CONTRIBUTING.md before sending changes. Keep changes focused, keep docs aligned with the real runtime surface, and keep the repository free of duplicate workflow scaffolding.
Tiny-LLM is released under the MIT License.