Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection#442
Open
ArchishmanSengupta wants to merge 3 commits intoopenai:mainfrom
Open
Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection#442ArchishmanSengupta wants to merge 3 commits intoopenai:mainfrom
ArchishmanSengupta wants to merge 3 commits intoopenai:mainfrom
Conversation
Author
|
@hauntsaninja can you take a look in this? |
antimora
added a commit
to antimora/wordchipper
that referenced
this pull request
Feb 22, 2026
Replace MergeHeapSpanEncoder internals with a true min-heap + linked-list algorithm. The old implementation did O(n) linear scans per merge round (same as IncrementalSweepSpanEncoder). The new approach uses a BinaryHeap for O(log n) min-finding and a doubly-linked list for O(1) token removal, giving O(m log n) total vs O(m*n). Uses lazy deletion via per-position generation counters to avoid expensive heap removal. This is the same approach as tiktoken PR #442 (openai/tiktoken#442), though that targets an O(n log n) algorithm which is more complex. Wire MergeHeapSpanEncoder as the default encoder. Benchmark results (median MB/s, single-threaded): English cl100k: 90 -> 123 MB/s (+37%) English o200k: 87 -> 111 MB/s (+28%) Diverse cl100k: 57 -> 92 MB/s (+61%) Diverse o200k: 29 -> 84 MB/s (+190%) Multilingual slowdown ratio (diverse vs english): cl100k: 37% -> 25% o200k: 67% -> 25% Fixes zspacelabs#173
antimora
added a commit
to antimora/wordchipper
that referenced
this pull request
Feb 22, 2026
Add a new HybridSpanEncoder that picks the best merge strategy per span: - Short spans (<= 16 bytes): inline linear sweep, same as IncrementalSweepSpanEncoder - Long spans (> 16 bytes): min-heap + doubly-linked list for O(m log n) merging The heap path uses a BinaryHeap for O(log n) min-finding, index arrays for O(1) linked-list removal, and generation counters for lazy staleness detection. Similar approach to openai/tiktoken#442, though that targets an O(n log n) algorithm which is more complex. Wire HybridSpanEncoder as the default. Existing IncrementalSweepSpanEncoder and MergeHeapSpanEncoder are preserved for side-by-side comparison. Benchmark results (median MB/s, single-threaded divan): English cl100k: 126 -> 124 MB/s (within noise) English o200k: 120 -> 118 MB/s (within noise) Diverse cl100k: 93 -> 91 MB/s (within noise) Diverse o200k: 83 -> 80 MB/s (within noise) sample-timer: 961 -> 990 MB/s (+3%) The big win is vs the issue zspacelabs#173 baseline where diverse o200k was 29 MB/s. Addresses zspacelabs#173
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
-> Replaced the
O(m·n)sequential merge scan with a heap-driven algorithm that maintains candidate merges in a max-heap keyed by rank, updating only local neighbors on each merge.-> This yields
m·log nbehavior where:m: number of merges andn: is the number of initial symbolsKey changes:
_byte_pair_merge, maintaining a linked-list of live nodes and per-position versions to avoid stale heap entries.compute_rank_atand updates only affected neighbors after each merge._byte_pair_mergeboundaries.Complexity:
Before: repeated linear scans → approximately
O(m·n)in worst-case merges.After: heap operations per merge →
O(m·log n), withO(n)initialization.