Skip to content

Commit bc8d868

Browse files
Merge pull request #7 from XyLearningProgramming/feat/comp
✨ feat/comp
2 parents 44cc9ac + 2370d81 commit bc8d868

21 files changed

Lines changed: 2713 additions & 895 deletions

README.md

Lines changed: 33 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,48 @@
1-
# Small-Language-Model Server
1+
# Small Language Model Server
22

33
[![CI Pipeline](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml/badge.svg)](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml)
44
[![codecov](https://codecov.io/gh/XyLearningProgramming/slm_server/branch/main/graph/badge.svg)](https://codecov.io/gh/XyLearningProgramming/slm_server)
55
[![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/x3huang/slm_server)
66
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
77

8-
🚀 A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) as a **thin wrapper** around `llama-cpp` exposing the OpenAI-compatible `/chat/completions` API. Core logic is just <100 lines under `./slm_server/app.py`!
8+
A lightweight model server that serves small language models (default: Qwen3-0.6B-GGUF) as a thin wrapper around llama-cpp with OpenAI-compatible `/chat/completions` API. Core logic is <100 lines in `./slm_server/app.py`.
99

10-
> This is still a WIP project. Issues, pull-requests are welcome. I mainly use this repo to deploy a SLM model as part of the backend on my own site [x3huang.dev](https://x3huang.dev/) while trying my best to keep this repo model-agonistic.
10+
## Features
1111

12-
## ✨ Features
12+
- **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
13+
- **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
14+
- **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing
15+
- **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
16+
- **Simple configuration** - Environment-based config with sensible defaults
1317

14-
![Thin wrapper around llama cpp](./docs/20250712_slm_img1.jpg)
18+
## Use Cases
1519

16-
- 🔌 **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
17-
- **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
18-
- 📊 **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing (all configurable)
19-
- 🚀 **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
20-
- 🔧 **Simple configuration** - Environment-based config with sensible defaults
20+
- **Self-hosting** - Deploy small models under resource constraints
21+
- **Privacy-first inference** - No user content logging, complete data control
22+
- **Development environments** - Local LLM testing and prototyping
23+
- **Edge deployments** - Lightweight inference in constrained environments
24+
- **API standardization** - Unified OpenAI-compatible interface for small models
2125

22-
## 🚀 Quick Start
26+
## Quick Start
2327

2428
### Local Development
2529

2630
```bash
27-
# 1. Get your model
31+
# Download model
2832
./scripts/download.sh # Downloads default Qwen3-0.6B-GGUF
29-
# OR place your own GGUF model in models/ directory
3033

31-
# 2. Install dependencies
34+
# Install and start
3235
uv sync
33-
34-
# 3. Configure (optional)
35-
cp .env.example .env # Edit as needed
36-
37-
# 4. Start the server
3836
./scripts/start.sh
3937
```
4038

4139
### Docker
4240

4341
```bash
44-
# Pull and run
4542
docker run -p 8000:8000 -v $(pwd)/models:/app/models x3huang/slm_server/general
46-
47-
# Or build locally
48-
docker build -t slm-server .
49-
docker run -p 8000:8000 -v $(pwd)/models:/app/models slm_server
5043
```
5144

52-
### Test the API
45+
### Test API
5346

5447
```bash
5548
curl -X POST http://localhost:8000/api/v1/chat/completions \
@@ -61,57 +54,26 @@ curl -X POST http://localhost:8000/api/v1/chat/completions \
6154
}'
6255
```
6356

64-
## 🎯 Why SLM Server?
65-
66-
- **🎯 Unified access** - Single point of entry for SLM inference with concurrency control
67-
- **💰 Cost-effective** - Perfect for self-hosting small models under resource constraints
68-
- **🔒 Privacy-matters** - No user content logging, complete data control
69-
- **⚡ Performance** - As thin wrapper around `llama-cpp`
70-
71-
## 📊 Observability Stack
72-
73-
All observability components are **configurable** and **enabled by default** for production readiness.
74-
75-
### 📝 Structured Logging
76-
Request lifecycle logging with trace correlation:
77-
78-
```log
79-
2025-07-21 09:52:32,475 INFO [slm_server.utils] 2025-07-21 09:52:32,475 INFO [slm_server.utils] [utils.py:341] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] starting streaming: {'max_tokens': 2048, 'temperature': 0.7, 'input_messages': 1, 'input_content_length': 15}
80-
81-
2025-07-21 09:52:36,496 INFO [slm_server.utils] [utils.py:404] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] completed streaming: {'duration_ms': 4021.32, 'output_content_length': 468, 'total_tokens': 111, 'completion_tokens': 108, 'completion_tokens_per_second': 26.86, 'total_tokens_per_second': 27.6, 'chunk_count': 108, 'avg_chunk_delay_ms': 37.23, 'first_token_delay_ms': 38.19, 'avg_chunk_size': 259.45, 'avg_chunk_content_size': 4.25, 'chunks_with_content': 108, 'empty_chunks': 2}
82-
```
83-
84-
### 📈 Prometheus Metrics
85-
Available at `/metrics` endpoint:
86-
- Request latency and throughput
87-
- Token generation rates
88-
- Model memory usage
89-
- Error rates and types
57+
## Observability
9058

91-
### 🔍 OpenTelemetry Tracing
92-
Distributed tracing with:
93-
- Request flow visualization, each stream response as extra event if any
94-
- Performance bottleneck identification
59+
All observability components are configurable and enabled by default:
9560

96-
## ⚙️ Configuration
61+
- **Structured Logging** - Request lifecycle logging with trace correlation
62+
- **Prometheus Metrics** - Available at `/metrics` (latency, throughput, token rates, memory usage)
63+
- **OpenTelemetry Tracing** - Distributed tracing with request flow visualization
9764

98-
Configure via environment variables (prefix: `SLM_`) or `.env` file.
65+
## Configuration
9966

100-
See [`./slm_server/config.py`](./slm_server/config.py) for complete configuration options.
67+
Configure via environment variables (prefix: `SLM_`) or `.env` file. See [`./slm_server/config.py`](./slm_server/config.py) for all options.
10168

102-
## 🚢 Deployment
69+
## Deployment
10370

10471
### Kubernetes with Helm
10572

10673
```bash
107-
# Deploy to production
10874
helm upgrade --install slm-server ./deploy/helm \
10975
--namespace backend \
11076
--values ./deploy/helm/values.yaml
111-
112-
# Monitor deployment
113-
kubectl get pods -n backend
114-
kubectl logs -f deployment/slm-server -n backend
11577
```
11678

11779
### Docker Compose
@@ -125,43 +87,38 @@ services:
12587
- "8000:8000"
12688
volumes:
12789
- ./models:/app/models
128-
# Optional
12990
environment:
13091
- slm_server_PATH=/app/models/your-model.gguf
13192
```
13293
133-
## 🧪 Development
94+
## Development
13495
135-
### Running Tests
96+
### Testing
13697
13798
```bash
13899
# Unit tests
139100
uv run pytest tests/ --ignore=tests/e2e/
140101

141-
# End-to-end tests (with server pulled up)
102+
# End-to-end tests
142103
uv run python ./tests/e2e/main.py
143104

144105
# With coverage
145-
uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html --cov-report=term-missing
106+
uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html
146107
```
147108

148109
### Code Quality
149110

150111
```bash
151-
# Linting and formatting
152112
uv run ruff check .
153113
uv run ruff format .
154114
```
155115

156-
## 📚 API Documentation
116+
## API Documentation
157117

158-
Once running, visit:
159118
- **Interactive docs**: http://localhost:8000/docs
160119
- **OpenAPI spec**: http://localhost:8000/openapi.json
161120
- **Health check**: http://localhost:8000/health
162121

163-
## 📄 License
164-
165-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
166-
122+
## License
167123

124+
MIT License - see [LICENSE](LICENSE) file for details.

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ select = ["C", "E", "F", "W"]
2626
[dependency-groups]
2727
dev = [
2828
"httpx>=0.28.1",
29+
"langchain>=0.3.26",
30+
"langchain-core>=0.3.71",
31+
"langchain-openai>=0.3.28",
2932
"pytest>=8.4.1",
3033
"pytest-cov>=4.0.0",
3134
"ruff>=0.12.3",

pytest.ini

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
[pytest]
2+
markers =
3+
api: marks tests as api tests
4+
api_non_streaming: marks tests as api and non_streaming tests
5+
langchain: marks tests as langchain compatibility tests

slm_server/app.py

Lines changed: 62 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,27 @@
11
import asyncio
2+
import json
23
import traceback
4+
from http import HTTPStatus
35
from typing import Annotated, AsyncGenerator
46

57
from fastapi import Depends, FastAPI, HTTPException
68
from fastapi.responses import StreamingResponse
7-
from llama_cpp import Llama
9+
from llama_cpp import CreateChatCompletionStreamResponse, Llama
810

911
from slm_server.config import Settings, get_settings
1012
from slm_server.logging import setup_logging
1113
from slm_server.metrics import setup_metrics
1214
from slm_server.model import (
1315
ChatCompletionRequest,
14-
ChatCompletionResponse,
15-
ChatCompletionStreamResponse,
16+
EmbeddingRequest,
1617
)
1718
from slm_server.trace import setup_tracing
1819
from slm_server.utils import (
1920
set_atrribute_response,
2021
set_atrribute_response_stream,
22+
set_attribute_cancelled,
23+
set_attribute_response_embedding,
24+
slm_embedding_span,
2125
slm_span,
2226
)
2327

@@ -28,6 +32,11 @@
2832
MAX_CONCURRENCY = 1
2933
# Default timeout message in detail field.
3034
DETAIL_SEM_TIMEOUT = "Server is busy, please try again later."
35+
# Status code for semaphore timeout.
36+
STATUS_CODE_SEM_TIMEOUT = HTTPStatus.REQUEST_TIMEOUT
37+
# Status code for unexpected errors.
38+
# This is used when the server encounters an error that is not handled
39+
STATUS_CODE_EXCEPTION = HTTPStatus.INTERNAL_SERVER_ERROR
3140

3241

3342
def get_llm_semaphor() -> asyncio.Semaphore:
@@ -46,9 +55,10 @@ def get_llm(settings: Annotated[Settings, Depends(get_settings)]) -> Llama:
4655
verbose=settings.logging.verbose,
4756
seed=settings.seed,
4857
logits_all=False,
49-
embedding=False,
58+
embedding=True,
5059
use_mlock=True, # Use mlock to prevent memory swapping
5160
use_mmap=True, # Use memory-mapped files for faster access
61+
chat_format="chatml-function-calling",
5262
)
5363
return get_llm._instance
5464

@@ -77,18 +87,17 @@ def get_app() -> FastAPI:
7787

7888

7989
async def lock_llm_semaphor(
80-
req: ChatCompletionRequest,
8190
sem: Annotated[asyncio.Semaphore, Depends(get_llm_semaphor)],
8291
settings: Annotated[Settings, Depends(get_settings)],
8392
) -> AsyncGenerator[None, None]:
8493
"""Context manager to acquire and release the LLM semaphore with a timeout."""
8594
try:
86-
await asyncio.wait_for(
87-
sem.acquire(), timeout=req.wait_timeout or settings.s_timeout
88-
)
95+
await asyncio.wait_for(sem.acquire(), settings.s_timeout)
8996
yield None
9097
except asyncio.TimeoutError:
91-
raise HTTPException(status_code=503, detail=DETAIL_SEM_TIMEOUT)
98+
raise HTTPException(
99+
status_code=STATUS_CODE_SEM_TIMEOUT, detail=DETAIL_SEM_TIMEOUT
100+
)
92101
finally:
93102
if sem.locked():
94103
sem.release()
@@ -98,42 +107,36 @@ async def run_llm_streaming(
98107
llm: Llama, req: ChatCompletionRequest
99108
) -> AsyncGenerator[str, None]:
100109
"""Generator that runs the LLM and yields SSE chunks under lock."""
101-
with slm_span(req, is_streaming=True) as (span, messages_for_llm):
102-
completion_stream = await asyncio.to_thread(
103-
llm.create_chat_completion,
104-
messages=messages_for_llm,
105-
max_tokens=req.max_tokens,
106-
temperature=req.temperature,
107-
stream=True,
108-
)
110+
with slm_span(req, is_streaming=True) as span:
111+
try:
112+
completion_stream = await asyncio.to_thread(
113+
llm.create_chat_completion,
114+
**req.model_dump(),
115+
)
109116

110-
# Use traced iterator that automatically handles chunk spans
111-
# and parent span updates
112-
for chunk in completion_stream:
113-
response_model = ChatCompletionStreamResponse.model_validate(chunk)
114-
set_atrribute_response_stream(span, response_model)
115-
yield f"data: {response_model.model_dump_json()}\n\n"
117+
# Use traced iterator that automatically handles chunk spans
118+
# and parent span updates
119+
chunk: CreateChatCompletionStreamResponse
120+
for chunk in completion_stream:
121+
set_atrribute_response_stream(span, chunk)
122+
yield f"data: {json.dumps(chunk)}\n\n"
116123

117-
yield "data: [DONE]\n\n"
124+
yield "data: [DONE]\n\n"
125+
except asyncio.CancelledError:
126+
# Handle cancellation gracefully during sse.
127+
set_attribute_cancelled(span)
118128

119129

120-
async def run_llm_non_streaming(
121-
llm: Llama, req: ChatCompletionRequest
122-
) -> ChatCompletionResponse:
130+
async def run_llm_non_streaming(llm: Llama, req: ChatCompletionRequest):
123131
"""Runs the LLM for a non-streaming request under lock."""
124-
with slm_span(req, is_streaming=False) as (span, messages_for_llm):
132+
with slm_span(req, is_streaming=False) as span:
125133
completion_result = await asyncio.to_thread(
126134
llm.create_chat_completion,
127-
messages=messages_for_llm,
128-
max_tokens=req.max_tokens,
129-
temperature=req.temperature,
130-
stream=False,
135+
**req.model_dump(),
131136
)
137+
set_atrribute_response(span, completion_result)
132138

133-
response_model = ChatCompletionResponse.model_validate(completion_result)
134-
set_atrribute_response(span, response_model)
135-
136-
return response_model
139+
return completion_result
137140

138141

139142
@app.post("/api/v1/chat/completions")
@@ -156,7 +159,29 @@ async def create_chat_completion(
156159
except Exception:
157160
# Catch any other unexpected errors
158161
error_str = traceback.format_exc()
159-
raise HTTPException(status_code=500, detail=error_str)
162+
raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)
163+
164+
165+
@app.post("/api/v1/embeddings")
166+
async def create_embeddings(
167+
req: EmbeddingRequest,
168+
llm: Annotated[Llama, Depends(get_llm)],
169+
_: Annotated[None, Depends(lock_llm_semaphor)],
170+
):
171+
"""Create embeddings for the given input text(s)."""
172+
try:
173+
with slm_embedding_span(req) as span:
174+
# Use llama-cpp-python's create_embedding method directly
175+
embedding_result = await asyncio.to_thread(
176+
llm.create_embedding,
177+
**req.model_dump(),
178+
)
179+
# Convert llama-cpp response using model_validate like chat completion
180+
set_attribute_response_embedding(span, embedding_result)
181+
return embedding_result
182+
except Exception:
183+
error_str = traceback.format_exc()
184+
raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)
160185

161186

162187
@app.get("/health")

0 commit comments

Comments
 (0)