You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
## [0.3.32] Hybrid/Multimodal Model Single-Turn Optimizations & Fix Sampling Seed
11
+
12
+
- perf(hybrid): optimize multimodal single-turn and fix KV clear bug
13
+
- Added a 100% match "FAST PATH" in Llama.generate to bypass N-1 truncation for hybrid models when caching is disabled.
14
+
- Fixed a bug where failed rollbacks on disabled caches would wipe the KV cache, causing multimodal pseudo-token crashes.
15
+
- Updated MTMDChatHandler to suppress cache-related logs and anchoring logic when max_checkpoints <= 0.
16
+
17
+
- perf(hybrid): prevent expensive array slicing when cache is disabled
18
+
- Added a `max_checkpoints > 0` check to the `finally` block of the generation loop.
19
+
- Previously, even though the underlying C++ state extraction was bypassed, the Python layer was still executing `self._input_ids[:self.n_tokens].tolist()`. For long contexts, slicing and converting this massive array to a Python list caused unnecessary CPU overhead and garbage collection (GC) pressure. This intercept acts as a double-layer isolation, ensuring absolute zero memory allocation and zero overhead for hybrid models running in single-turn mode.
20
+
21
+
- perf(hybrid): bypass N-1 evaluation split if max_checkpoints is 0
22
+
- Prevent fragmenting the prompt evaluation into `len(tokens)-1` and `1` when hybrid caching is disabled.
23
+
- Allows the underlying C++ engine to process the entire prompt in a single, efficient batch for single-turn workflows.
24
+
25
+
- perf(hybrid): eliminate PCIe I/O latency for single-turn workflows
26
+
- This commit introduces critical performance optimizations and log tracing improvements for HybridCheckpointCache in single-turn workflows (e.g., ComfyUI or single-turn conversation mode):
27
+
- Now support 0 HybridCheckpointCache for single-turn conversation.(set the `ctx_checkpoints=0` when llama init )
28
+
- Added early-exit intercepts for `max_checkpoints <= 0` in `save_checkpoint` and `find_best_checkpoint`. This prevents massive (e.g., 150MB+) synchronous VRAM-to-RAM state extractions over the PCIe bus when rollback capabilities are disabled, eliminating a ~3-second blocking delay at the end of generation.
29
+
- Added a non-empty check in `clear()` to prevent log spam when the cache is already empty or disabled.
30
+
- Standardized logging prefixes (e.g., `HybridCheckpointCache(save_checkpoint)`) for better observability.
31
+
- Fixed a potential `UnicodeEncodeError` hazard in warning logs by replacing a non-standard arrow character with standard ASCII (`->`).
32
+
33
+
- fix(sampling): pass seed to sampling context and remove global mutation
34
+
- Add `seed` parameter to `generate` and `sample` method signatures.
35
+
- Pass the resolved seed directly to `LlamaSamplingParams` to ensure the underlying C++ sampler uses it.
36
+
- Remove thread-unsafe `self.set_seed()` calls in `_create_completion` to prevent global state pollution during concurrent requests.
37
+
38
+
- docs(issue-template): modernize bug report for efficiency
39
+
- Completely revamped the legacy bug report template to streamline troubleshooting. Added an anti-AI-spam policy, a detailed OS/Hardware matrix, forced `verbose=True` logging requirements with code examples, and new sections for model parameters and AI-assisted brainstorming.
40
+
41
+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25](https://github.com/ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25)
42
+
10
43
## [0.3.31] Omni-Modal Media Pipeline, Hybrid 1-Token Rollback and Enhanced Logging
11
44
12
45
- refactor(mtmd): introduce omni-modal media pipeline with experimental audio support
0 commit comments