fix: expand stop sequences for Gemma4ChatHandler

JamePeng · JamePeng · commit d4fef6a268d7 · 2026-04-06T10:20:59.000+08:00
- Add `GEMMA4_EOS_TOKEN` and `GEMMA4_STR_TOKEN` to the generation stop criteria.
- Align the stopping logic with the model's `generation_config.json` definitions.
- Prevent potential over-generation by ensuring the model halts correctly at standard EOS or when initiating a tool response.

Signed-off-by: JamePeng &lt;jame_peng@sina.com&gt;
diff --git a/llama_cpp/llama_chat_format.py b/llama_cpp/llama_chat_format.py
@@ -4656,7 +4656,8 @@ def __call__(self, **kwargs):
         self.extra_template_arguments["enable_thinking"] = self.enable_thinking
 
         # Set the stop token based on Gemma 4's format (<turn|>)
-        kwargs['stop'] = [self.GEMMA4_EOT_TOKEN]
+        # generation_config.json:   "eos_token_id": [ 1, 106, 50]
+        kwargs['stop'] = [self.GEMMA4_EOS_TOKEN, self.GEMMA4_EOT_TOKEN, self.GEMMA4_STR_TOKEN]
 
         if self.verbose:
             print(f"{self.log_prefix}(enable_thinking={self.enable_thinking}) - Start processing")