Skip to content

Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection#442

Open
ArchishmanSengupta wants to merge 3 commits intoopenai:mainfrom
ArchishmanSengupta:enchancement/mlogn
Open

Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection#442
ArchishmanSengupta wants to merge 3 commits intoopenai:mainfrom
ArchishmanSengupta:enchancement/mlogn

Conversation

@ArchishmanSengupta
Copy link
Copy Markdown

Summary

-> Replaced the O(m·n) sequential merge scan with a heap-driven algorithm that maintains candidate merges in a max-heap keyed by rank, updating only local neighbors on each merge.

-> This yields m·log n behavior where:
m: number of merges and
n: is the number of initial symbols

Key changes:

  1. BinaryHeap and candidate struct in _byte_pair_merge, maintaining a linked-list of live nodes and per-position versions to avoid stale heap entries.
  2. Computes local ranks via compute_rank_at and updates only affected neighbors after each merge.
  3. Added targeted unit tests for _byte_pair_merge boundaries.

Complexity:

Before: repeated linear scans → approximately O(m·n) in worst-case merges.
After: heap operations per merge → O(m·log n), with O(n) initialization.

@ArchishmanSengupta
Copy link
Copy Markdown
Author

ArchishmanSengupta commented Oct 22, 2025

@hauntsaninja can you take a look in this?

antimora added a commit to antimora/wordchipper that referenced this pull request Feb 22, 2026
Replace MergeHeapSpanEncoder internals with a true min-heap + linked-list
algorithm. The old implementation did O(n) linear scans per merge round
(same as IncrementalSweepSpanEncoder). The new approach uses a BinaryHeap
for O(log n) min-finding and a doubly-linked list for O(1) token removal,
giving O(m log n) total vs O(m*n).

Uses lazy deletion via per-position generation counters to avoid expensive
heap removal. This is the same approach as tiktoken PR #442
(openai/tiktoken#442), though that targets an
O(n log n) algorithm which is more complex.

Wire MergeHeapSpanEncoder as the default encoder.

Benchmark results (median MB/s, single-threaded):

  English cl100k:  90 -> 123 MB/s (+37%)
  English o200k:   87 -> 111 MB/s (+28%)
  Diverse cl100k:  57 ->  92 MB/s (+61%)
  Diverse o200k:   29 ->  84 MB/s (+190%)

Multilingual slowdown ratio (diverse vs english):
  cl100k: 37% -> 25%
  o200k:  67% -> 25%

Fixes zspacelabs#173
antimora added a commit to antimora/wordchipper that referenced this pull request Feb 22, 2026
Add a new HybridSpanEncoder that picks the best merge strategy per span:
- Short spans (<= 16 bytes): inline linear sweep, same as IncrementalSweepSpanEncoder
- Long spans (> 16 bytes): min-heap + doubly-linked list for O(m log n) merging

The heap path uses a BinaryHeap for O(log n) min-finding, index arrays for
O(1) linked-list removal, and generation counters for lazy staleness detection.
Similar approach to openai/tiktoken#442, though that
targets an O(n log n) algorithm which is more complex.

Wire HybridSpanEncoder as the default. Existing IncrementalSweepSpanEncoder
and MergeHeapSpanEncoder are preserved for side-by-side comparison.

Benchmark results (median MB/s, single-threaded divan):

  English cl100k:  126 -> 124 MB/s (within noise)
  English o200k:   120 -> 118 MB/s (within noise)
  Diverse cl100k:   93 ->  91 MB/s (within noise)
  Diverse o200k:    83 ->  80 MB/s (within noise)
  sample-timer:    961 -> 990 MB/s (+3%)

The big win is vs the issue zspacelabs#173 baseline where diverse o200k was 29 MB/s.

Addresses zspacelabs#173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant