Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I am running LiquidAI's LFM2.5-1.2B-Instruct model. Calling create_chat_completion() multiple times should not throw error.
Current Behavior
When trying to call create_chat_completion twice in a row, the model throws error "llama_decode: failed to decode, ret = -1"
Environment and Context
I am using llama-cpp-python v0.3.16. Python version is 3.13.9.
Failure Information (for bugs)
On tracing back the issue, it looks like there needs to be a context and cache reset after each chat_completion call, which isn't happening yet.
Steps to Reproduce
from pathlib import Path
from llama_cpp import Llama
llm = Llama(
model_path=str(Path.home() / "AppData/Local/llama.cpp/LiquidAI_LFM2.5-1.2B-Instruct-GGUF_LFM2.5-1.2B-Instruct-Q4_K_M.gguf"),
n_ctx=1000
)
system_prompt = """
\nYou are a helpful assistant
"""
prompt = """
suggest me places to visit during winter season
"""
response = llm.create_chat_completion(
messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}],
)
print(response)
# llm.reset() # Using this works
# llm._ctx.kv_cache_clear() # Using this works
response = llm.create_chat_completion(
messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}],
)
print(response)
Failure Logs
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 519
- the tokens for sequence 0 in the input batch have a starting position of Y = 29
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I am running LiquidAI's LFM2.5-1.2B-Instruct model. Calling
create_chat_completion()multiple times should not throw error.Current Behavior
When trying to call
create_chat_completiontwice in a row, the model throws error "llama_decode: failed to decode, ret = -1"Environment and Context
I am using llama-cpp-python v0.3.16. Python version is 3.13.9.
Failure Information (for bugs)
On tracing back the issue, it looks like there needs to be a context and cache reset after each chat_completion call, which isn't happening yet.
Steps to Reproduce
Failure Logs