Skip to content

Commit e7e1d48

Browse files
committed
Bump version to 0.3.32
1 parent 3e41921 commit e7e1d48

2 files changed

Lines changed: 34 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.32] Hybrid/Multimodal Model Single-Turn Optimizations & Fix Sampling Seed
11+
12+
- perf(hybrid): optimize multimodal single-turn and fix KV clear bug
13+
- Added a 100% match "FAST PATH" in Llama.generate to bypass N-1 truncation for hybrid models when caching is disabled.
14+
- Fixed a bug where failed rollbacks on disabled caches would wipe the KV cache, causing multimodal pseudo-token crashes.
15+
- Updated MTMDChatHandler to suppress cache-related logs and anchoring logic when max_checkpoints <= 0.
16+
17+
- perf(hybrid): prevent expensive array slicing when cache is disabled
18+
- Added a `max_checkpoints > 0` check to the `finally` block of the generation loop.
19+
- Previously, even though the underlying C++ state extraction was bypassed, the Python layer was still executing `self._input_ids[:self.n_tokens].tolist()`. For long contexts, slicing and converting this massive array to a Python list caused unnecessary CPU overhead and garbage collection (GC) pressure. This intercept acts as a double-layer isolation, ensuring absolute zero memory allocation and zero overhead for hybrid models running in single-turn mode.
20+
21+
- perf(hybrid): bypass N-1 evaluation split if max_checkpoints is 0
22+
- Prevent fragmenting the prompt evaluation into `len(tokens)-1` and `1` when hybrid caching is disabled.
23+
- Allows the underlying C++ engine to process the entire prompt in a single, efficient batch for single-turn workflows.
24+
25+
- perf(hybrid): eliminate PCIe I/O latency for single-turn workflows
26+
- This commit introduces critical performance optimizations and log tracing improvements for HybridCheckpointCache in single-turn workflows (e.g., ComfyUI or single-turn conversation mode):
27+
- Now support 0 HybridCheckpointCache for single-turn conversation.(set the `ctx_checkpoints=0` when llama init )
28+
- Added early-exit intercepts for `max_checkpoints <= 0` in `save_checkpoint` and `find_best_checkpoint`. This prevents massive (e.g., 150MB+) synchronous VRAM-to-RAM state extractions over the PCIe bus when rollback capabilities are disabled, eliminating a ~3-second blocking delay at the end of generation.
29+
- Added a non-empty check in `clear()` to prevent log spam when the cache is already empty or disabled.
30+
- Standardized logging prefixes (e.g., `HybridCheckpointCache(save_checkpoint)`) for better observability.
31+
- Fixed a potential `UnicodeEncodeError` hazard in warning logs by replacing a non-standard arrow character with standard ASCII (`->`).
32+
33+
- fix(sampling): pass seed to sampling context and remove global mutation
34+
- Add `seed` parameter to `generate` and `sample` method signatures.
35+
- Pass the resolved seed directly to `LlamaSamplingParams` to ensure the underlying C++ sampler uses it.
36+
- Remove thread-unsafe `self.set_seed()` calls in `_create_completion` to prevent global state pollution during concurrent requests.
37+
38+
- docs(issue-template): modernize bug report for efficiency
39+
- Completely revamped the legacy bug report template to streamline troubleshooting. Added an anti-AI-spam policy, a detailed OS/Hardware matrix, forced `verbose=True` logging requirements with code examples, and new sections for model parameters and AI-assisted brainstorming.
40+
41+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25](https://github.com/ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25)
42+
1043
## [0.3.31] Omni-Modal Media Pipeline, Hybrid 1-Token Rollback and Enhanced Logging
1144

1245
- refactor(mtmd): introduce omni-modal media pipeline with experimental audio support

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.3.31"
4+
__version__ = "0.3.32"

0 commit comments

Comments
 (0)