Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection by ArchishmanSengupta · Pull Request #442 · openai/tiktoken

ArchishmanSengupta · 2025-08-27T11:56:39Z

Summary

-> Replaced the O(m·n) sequential merge scan with a heap-driven algorithm that maintains candidate merges in a max-heap keyed by rank, updating only local neighbors on each merge.

-> This yields m·log n behavior where:
m: number of merges and
n: is the number of initial symbols

Key changes:

BinaryHeap and candidate struct in _byte_pair_merge, maintaining a linked-list of live nodes and per-position versions to avoid stale heap entries.
Computes local ranks via compute_rank_at and updates only affected neighbors after each merge.
Added targeted unit tests for _byte_pair_merge boundaries.

Complexity:

Before: repeated linear scans → approximately O(m·n) in worst-case merges.
After: heap operations per merge → O(m·log n), with O(n) initialization.

…ndidate selection

ArchishmanSengupta · 2025-10-22T13:27:47Z

@hauntsaninja can you take a look in this?

Replace MergeHeapSpanEncoder internals with a true min-heap + linked-list algorithm. The old implementation did O(n) linear scans per merge round (same as IncrementalSweepSpanEncoder). The new approach uses a BinaryHeap for O(log n) min-finding and a doubly-linked list for O(1) token removal, giving O(m log n) total vs O(m*n). Uses lazy deletion via per-position generation counters to avoid expensive heap removal. This is the same approach as tiktoken PR #442 (openai/tiktoken#442), though that targets an O(n log n) algorithm which is more complex. Wire MergeHeapSpanEncoder as the default encoder. Benchmark results (median MB/s, single-threaded): English cl100k: 90 -> 123 MB/s (+37%) English o200k: 87 -> 111 MB/s (+28%) Diverse cl100k: 57 -> 92 MB/s (+61%) Diverse o200k: 29 -> 84 MB/s (+190%) Multilingual slowdown ratio (diverse vs english): cl100k: 37% -> 25% o200k: 67% -> 25% Fixes zspacelabs#173

Add a new HybridSpanEncoder that picks the best merge strategy per span: - Short spans (<= 16 bytes): inline linear sweep, same as IncrementalSweepSpanEncoder - Long spans (> 16 bytes): min-heap + doubly-linked list for O(m log n) merging The heap path uses a BinaryHeap for O(log n) min-finding, index arrays for O(1) linked-list removal, and generation counters for lazy staleness detection. Similar approach to openai/tiktoken#442, though that targets an O(n log n) algorithm which is more complex. Wire HybridSpanEncoder as the default. Existing IncrementalSweepSpanEncoder and MergeHeapSpanEncoder are preserved for side-by-side comparison. Benchmark results (median MB/s, single-threaded divan): English cl100k: 126 -> 124 MB/s (within noise) English o200k: 120 -> 118 MB/s (within noise) Diverse cl100k: 93 -> 91 MB/s (within noise) Diverse o200k: 83 -> 80 MB/s (within noise) sample-timer: 961 -> 990 MB/s (+3%) The big win is vs the issue zspacelabs#173 baseline where diverse o200k was 29 MB/s. Addresses zspacelabs#173

ArchishmanSengupta and others added 2 commits August 27, 2025 17:00

feat(lib.rs): Optimize byte-pair merge to m·log n using heap-based ca…

fa5e2c7

…ndidate selection

Merge branch 'main' into enchancement/mlogn

c952b61

antimora mentioned this pull request Feb 22, 2026

Add HybridSpanEncoder for multilingual BPE speedup zspacelabs/wordchipper#178

Closed

Merge branch 'main' into enchancement/mlogn

94a4910

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection#442

Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection#442
ArchishmanSengupta wants to merge 3 commits intoopenai:mainfrom
ArchishmanSengupta:enchancement/mlogn

ArchishmanSengupta commented Aug 27, 2025

Uh oh!

ArchishmanSengupta commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ArchishmanSengupta commented Aug 27, 2025

Summary

Key changes:

Complexity:

Uh oh!

ArchishmanSengupta commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ArchishmanSengupta commented Oct 22, 2025 •

edited

Loading