Skip to content

Commit 770e3bf

Browse files
committed
Implement safe MLX KV reuse with scoped GPU cache eviction
1 parent 7b311b1 commit 770e3bf

File tree

4 files changed

+730
-28
lines changed

4 files changed

+730
-28
lines changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -488,6 +488,32 @@ let response = try await session.respond {
488488
}
489489
```
490490
491+
You can tune MLX KV-cache behavior per request with model-specific options:
492+
493+
```swift
494+
var options = GenerationOptions(temperature: 0.7)
495+
options[custom: MLXLanguageModel.self] = .init(
496+
maxKVSize: 4096,
497+
kvBits: 4,
498+
kvGroupSize: 64,
499+
quantizedKVStart: 128
500+
)
501+
502+
let response = try await session.respond(
503+
to: "Summarize this transcript",
504+
options: options
505+
)
506+
```
507+
508+
GPU cache behavior can be configured when creating the model:
509+
510+
```swift
511+
let model = MLXLanguageModel(
512+
modelId: "mlx-community/Qwen3-0.6B-4bit",
513+
gpuMemory: .automatic
514+
)
515+
```
516+
491517
Vision support depends on the specific MLX model you load.
492518
Use a vision‑capable model for multimodal prompts
493519
(for example, a VLM variant).

0 commit comments

Comments
 (0)