Skip to content

Some edits and formatting cleanup.#125

Open
gorzell wants to merge 8 commits into
aneubeck/foldblogfrom
gorzell/blog-edits
Open

Some edits and formatting cleanup.#125
gorzell wants to merge 8 commits into
aneubeck/foldblogfrom
gorzell/blog-edits

Conversation

@gorzell

@gorzell gorzell commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Copilot AI review requested due to automatic review settings June 10, 2026 12:51
@gorzell gorzell requested a review from a team as a code owner June 10, 2026 12:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR relocates the casefold performance/release blog content from crates/casefold/BLOG.md into a docs-targeted Markdown file intended for rendering from crates/casefold/docs/.

Changes:

  • Added crates/casefold/docs/release_blog.md containing the blog post content.
  • Removed the previous crates/casefold/BLOG.md version of the post.
Show a summary per file
File Description
crates/casefold/docs/release_blog.md New docs-hosted blog Markdown; currently contains several Markdown/emphasis and Rust snippet formatting issues that affect rendering/copy-paste correctness.
crates/casefold/BLOG.md Deleted the prior blog post file from the crate root (content moved to docs).

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 11

Comment thread crates/casefold/docs/release_blog.md Outdated
Comment on lines +35 to +39
These diverge on real characters — `ß`,
`İ`, final sigma — and lowercasing as a stand-in silently produces incorrect matches. This crate implements the **simple
** (1-to-1) folds — statuses `C` and `S` in [
`CaseFolding.txt`](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) — and deliberately
*not* the multi-character "full" folds (`ß` → `ss`) or Turkic locale folds.
Comment thread crates/casefold/docs/release_blog.md
Comment thread crates/casefold/docs/release_blog.md Outdated
Comment on lines +80 to +88
let mut high_bit_acc: u8 = 0;
for b in & mut bytes {
high_bit_acc |= * b; // detect any non-ASCII byte
let is_upper = b.wrapping_sub(b'A') < 26; // branchless A..=Z test
* b |= u8::from(is_upper) < < 5; // set bit 5 → lowercase, else no-op
}
if high_bit_acc & 0x80 == 0 {
return bytes; // pure ASCII: already folded in place, no second buffer
}
Comment thread crates/casefold/docs/release_blog.md Outdated
Comment thread crates/casefold/docs/release_blog.md Outdated
Comment on lines +153 to +163
40 GiB/s also means doing zero unnecessary allocation. `simple_fold` takes the input `String` *by
value*, owning the heap buffer it can mutate and return it. If the OR-accumulator's high bit was clear, the input was pure ASCII — already folded in place — we hand the
**same allocation** straight back, no second buffer and no copy. Otherwise we
`memchr` to the first non-ASCII byte and scan the tail from there, leaving the output buffer
*unallocated* (a null write cursor) until we hit a character that folds to **different bytes
**. Text whose multibyte content never folds — CJK, Hangul, Kana, Arabic, Hebrew, symbols — also returns the original allocation untouched, never copying a byte.

Why a *second* buffer rather than rewriting in place like the ASCII pass? Because folding can make the string **longer
**: almost every fold preserves the UTF-8 length or shrinks it, but two outliers grow — U+023A (`Ⱥ`) and U+023E (
`Ɀ`) are 2 bytes each yet fold to 3-byte characters (`ⱥ`,
`ɀ`). Once one appears, the output no longer fits in the input's bytes, and we need somewhere new to write.
Comment thread crates/casefold/docs/release_blog.md Outdated
Comment on lines +406 to +409
The pure-ASCII row is the fairest fight of all: there `str::to_lowercase`
produces the **exact same bytes
** we do — a correct std-library baseline rather than a different operation — and even then the branch-free sweep is ~1.5× faster (40.8 vs 27.7 GiB/s), because
`to_lowercase` still scans for the first non-ASCII byte and allocates a fresh
Comment on lines +220 to +231
let (word_idx, bit_idx, c_len) = if lead < 0xE0 {
(0usize, lead & 0x1F, 2usize) // 2-byte: word 0
} else if lead < 0xF0 {
((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3) // 3-byte: word = nibble
} else {
((((lead & 0x07) as usize) < < 6) | (bytes[read + 1] & 0x3F) as usize, bytes[read + 2] & 0x3F, 4usize,)// 4-byte: merge 2 bytes
};
// reject without decoding: clear bit ⇒ no fold
if word_idx > = PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 {
read += c_len;
continue;
}
Comment thread crates/casefold/docs/release_blog.md Outdated

The crate is [`casefold`](../README.md); the generated table and full design notes live alongside the source.

[^overlong]: The byte-space arithmetic assumes the input is **well-formed, shortest-form UTF-8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to decide if we want to use footnotes, info boxes or both.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something which expands inline would be nice IMO.

Addresses PR #125 review: move the 'Treat the absolute figures as
illustrative' note out of the table intro into a [^bench] footnote
defined at the bottom of the file alongside [^overlong].

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@krukow krukow left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - the main feedback from me is explain the concept/problem a bit before showing the performance table, and also explaining each of the columns and rows in that table (right now it's hard to understand what the terms refer to in the table for a reader unfamiliar with the subject).


Let's walk through the evolution in detail.

## Why case-folding is even important?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pull this up before the data table


Criterion medians on an Apple M4 (single core, `target-cpu=native`).[^bench]

| Workload (input size) | `simple_fold` | `simd_normalizer` | `HashMap` (byte path) | `str::to_lowercase` | `simdutf` round-trip |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explain how to read the table for a reader who is unfamiliar with the elements in the table?

> touches the data half as many times. It's the same lesson one more time: in the
> hot loop, the branch is the enemy.

It is genuinely faster to

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we elevate these lessons to the top of the article as a teaser and read on to learn more?

| A runtime `HashMap<u32, u32>` | ~17 KB |
| **This crate (paged bitmap + packed runs)** | **1776 B** |

## Takeaways

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this - consider elevate these take aways to the top of the article as a teaser and read on to learn more?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't like the "takeaways".
There isn't a real guideline you can follow and telling ppl to question everything is not really a very efficient one. So, I think it's more like: you can often find optimizations in unexpected places if you are searching hard enough.
And this solution won't be the final answer either.

Address Copilot review comments on PR #125: repair bold/italic markup
that was split across hard line breaks (so it rendered literally), fix
the 'branchs' -> 'branches' typo, and add the missing opening '**' on
the Note and Tip admonition lead-ins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants