Skip to content

Conversation

@danbev
Copy link
Member

@danbev danbev commented Jan 15, 2026

This commit adds write/read support for backend sampling state similar
to how the logits and embedding buffers are handled.

The motivation for this is that it adds the backend sampling state to
be saved/restored along with the rest of the llama_context state.


This commit build upon #18811 which is included as the first commit in this PR. I'll rebase and remove it once it has been reviewed and merged.

This commit updates output_reserve in llama-context.cpp to always
allocate sampling buffers regardless of whether sampling is needed for
the current batch.

The motivation for this is to avoid reallocations and branching based on
the sampling requirements of the batch.
This commit adds write/read support for backend sampling state similar
to how the logits and embedding buffers are handled.

The motivation for this is that it adds the backend sampling state to
be saved/restored along with the rest of the llama_context state.
@github-actions github-actions bot added the testing Everything test related label Jan 15, 2026
@ggerganov
Copy link
Member

Initially, I was thinking that since the samplers can now be part of the context state, we should also store this information. But it should include also the sampler states. And it gets very complicated.

But now I am wondering if we should instead remove the output ids, the logits and the embeddings from the state and only store the model info and the memory. I can't think of a meaningful use case for storing this information. And even if its needed, one can simply run the last token through llama_decode to obtain the necessary logits/embeddings. So maybe this is the better option as it will simplify the read/write logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants