Some edits and formatting cleanup.#125
Conversation
There was a problem hiding this comment.
Pull request overview
This PR relocates the casefold performance/release blog content from crates/casefold/BLOG.md into a docs-targeted Markdown file intended for rendering from crates/casefold/docs/.
Changes:
- Added
crates/casefold/docs/release_blog.mdcontaining the blog post content. - Removed the previous
crates/casefold/BLOG.mdversion of the post.
Show a summary per file
| File | Description |
|---|---|
crates/casefold/docs/release_blog.md |
New docs-hosted blog Markdown; currently contains several Markdown/emphasis and Rust snippet formatting issues that affect rendering/copy-paste correctness. |
crates/casefold/BLOG.md |
Deleted the prior blog post file from the crate root (content moved to docs). |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 11
| These diverge on real characters — `ß`, | ||
| `İ`, final sigma — and lowercasing as a stand-in silently produces incorrect matches. This crate implements the **simple | ||
| ** (1-to-1) folds — statuses `C` and `S` in [ | ||
| `CaseFolding.txt`](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) — and deliberately | ||
| *not* the multi-character "full" folds (`ß` → `ss`) or Turkic locale folds. |
| let mut high_bit_acc: u8 = 0; | ||
| for b in & mut bytes { | ||
| high_bit_acc |= * b; // detect any non-ASCII byte | ||
| let is_upper = b.wrapping_sub(b'A') < 26; // branchless A..=Z test | ||
| * b |= u8::from(is_upper) < < 5; // set bit 5 → lowercase, else no-op | ||
| } | ||
| if high_bit_acc & 0x80 == 0 { | ||
| return bytes; // pure ASCII: already folded in place, no second buffer | ||
| } |
| 40 GiB/s also means doing zero unnecessary allocation. `simple_fold` takes the input `String` *by | ||
| value*, owning the heap buffer it can mutate and return it. If the OR-accumulator's high bit was clear, the input was pure ASCII — already folded in place — we hand the | ||
| **same allocation** straight back, no second buffer and no copy. Otherwise we | ||
| `memchr` to the first non-ASCII byte and scan the tail from there, leaving the output buffer | ||
| *unallocated* (a null write cursor) until we hit a character that folds to **different bytes | ||
| **. Text whose multibyte content never folds — CJK, Hangul, Kana, Arabic, Hebrew, symbols — also returns the original allocation untouched, never copying a byte. | ||
|
|
||
| Why a *second* buffer rather than rewriting in place like the ASCII pass? Because folding can make the string **longer | ||
| **: almost every fold preserves the UTF-8 length or shrinks it, but two outliers grow — U+023A (`Ⱥ`) and U+023E ( | ||
| `Ɀ`) are 2 bytes each yet fold to 3-byte characters (`ⱥ`, | ||
| `ɀ`). Once one appears, the output no longer fits in the input's bytes, and we need somewhere new to write. |
| The pure-ASCII row is the fairest fight of all: there `str::to_lowercase` | ||
| produces the **exact same bytes | ||
| ** we do — a correct std-library baseline rather than a different operation — and even then the branch-free sweep is ~1.5× faster (40.8 vs 27.7 GiB/s), because | ||
| `to_lowercase` still scans for the first non-ASCII byte and allocates a fresh |
| let (word_idx, bit_idx, c_len) = if lead < 0xE0 { | ||
| (0usize, lead & 0x1F, 2usize) // 2-byte: word 0 | ||
| } else if lead < 0xF0 { | ||
| ((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3) // 3-byte: word = nibble | ||
| } else { | ||
| ((((lead & 0x07) as usize) < < 6) | (bytes[read + 1] & 0x3F) as usize, bytes[read + 2] & 0x3F, 4usize,)// 4-byte: merge 2 bytes | ||
| }; | ||
| // reject without decoding: clear bit ⇒ no fold | ||
| if word_idx > = PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 { | ||
| read += c_len; | ||
| continue; | ||
| } |
|
|
||
| The crate is [`casefold`](../README.md); the generated table and full design notes live alongside the source. | ||
|
|
||
| [^overlong]: The byte-space arithmetic assumes the input is **well-formed, shortest-form UTF-8 |
There was a problem hiding this comment.
We probably need to decide if we want to use footnotes, info boxes or both.
There was a problem hiding this comment.
Something which expands inline would be nice IMO.
Addresses PR #125 review: move the 'Treat the absolute figures as illustrative' note out of the table intro into a [^bench] footnote defined at the bottom of the file alongside [^overlong]. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
krukow
left a comment
There was a problem hiding this comment.
Looks good - the main feedback from me is explain the concept/problem a bit before showing the performance table, and also explaining each of the columns and rows in that table (right now it's hard to understand what the terms refer to in the table for a reader unfamiliar with the subject).
|
|
||
| Let's walk through the evolution in detail. | ||
|
|
||
| ## Why case-folding is even important? |
There was a problem hiding this comment.
I would pull this up before the data table
|
|
||
| Criterion medians on an Apple M4 (single core, `target-cpu=native`).[^bench] | ||
|
|
||
| | Workload (input size) | `simple_fold` | `simd_normalizer` | `HashMap` (byte path) | `str::to_lowercase` | `simdutf` round-trip | |
There was a problem hiding this comment.
can we explain how to read the table for a reader who is unfamiliar with the elements in the table?
| > touches the data half as many times. It's the same lesson one more time: in the | ||
| > hot loop, the branch is the enemy. | ||
|
|
||
| It is genuinely faster to |
There was a problem hiding this comment.
Can we elevate these lessons to the top of the article as a teaser and read on to learn more?
| | A runtime `HashMap<u32, u32>` | ~17 KB | | ||
| | **This crate (paged bitmap + packed runs)** | **1776 B** | | ||
|
|
||
| ## Takeaways |
There was a problem hiding this comment.
same for this - consider elevate these take aways to the top of the article as a teaser and read on to learn more?
There was a problem hiding this comment.
I honestly don't like the "takeaways".
There isn't a real guideline you can follow and telling ppl to question everything is not really a very efficient one. So, I think it's more like: you can often find optimizations in unexpected places if you are searching hard enough.
And this solution won't be the final answer either.
Address Copilot review comments on PR #125: repair bold/italic markup that was split across hard line breaks (so it rendered literally), fix the 'branchs' -> 'branches' typo, and add the missing opening '**' on the Note and Tip admonition lead-ins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rendered