You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🚀 A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) as a **thin wrapper** around `llama-cpp` exposing the OpenAI-compatible `/chat/completions` API. Core logic is just <100 lines under`./slm_server/app.py`!
8
+
A lightweight model server that serves small language models (default: Qwen3-0.6B-GGUF) as a thin wrapper around llama-cpp with OpenAI-compatible `/chat/completions` API. Core logic is <100 lines in`./slm_server/app.py`.
9
9
10
-
> This is still a WIP project. Issues, pull-requests are welcome. I mainly use this repo to deploy a SLM model as part of the backend on my own site [x3huang.dev](https://x3huang.dev/) while trying my best to keep this repo model-agonistic.
10
+
## Features
11
11
12
-
## ✨ Features
12
+
-**OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
13
+
-**Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
14
+
-**Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing
15
+
-**Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
16
+
-**Simple configuration** - Environment-based config with sensible defaults
13
17
14
-

18
+
## Use Cases
15
19
16
-
-🔌 **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
17
-
-⚡ **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
0 commit comments