nano-paged-attention

a minimal paged attention implementation to understand the main concepts of vllm engine.

why paged attention?

normal kv cache results in -

fragmented memory access - sequence lengths wont be equal all the time. frequent allocation and deallocations results in memory being allocated weirdly and leaves gaps in between, those gaps basically become unusable.
over allocation - when you pre-allocate a certain GB VRAM, theyre often underutilized and results in wastage.

paged attention borrrows concepts from OS and implements a paging based attention where you breaks the sequences into smaller pages and store them in blocks. during attention you iterate over all the pages and get the result.

components

page - the page(often called as block) is the smallest allocation unit for the kv cache.
- it stores the kv for a fixed number of tokens defined by the page_size (this is not bytes)
- it also has a ref_count that tells you how many sequences are using this particular page. this enables two things -
  1. prefix sharing - a lot of the requests start with the same system prompt. it makes sense to not store duplicated kv cache for this each time. prefix sharing enables you to point to the same page if they share the same prefix tokens.
  2. decoding - for some of the decoding strategies like beam search etc requires starting from the same tokens but diverges as the generation moves forward in time. in this case, multiple sequences share the same initial kv pages but diverge in the future.
- it lives in the physical GPU memory.
page_table - the page table keeps a mapping of the logical pages to the actual physical page in the GPU.
- each request maintains its own page table.
- the page table gives the illusion of the pages being contiguous in memory, because the logical pages seem ordered and continuous.
sequence - the sequence represents the user's decoding request. it has -
- token ids
- status (WAITING/RUNNING/FINISHED)
- page table
- current position
block_manager - the block manager is the whole heart which handles and maintains the allocation and deallocation of pages for every sequence.
- handles allocation for the prefill stage
- does incremental allocation for the decoding phase.
- also does the reference counting for each page
- frees the pages when the sequence is finished.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
paged_attention.py		paged_attention.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nano-paged-attention

components

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nano-paged-attention

components

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages