Skip to content

perf: Optimize split_part, support Utf8View#21119

Open
neilconway wants to merge 6 commits intoapache:mainfrom
neilconway:neilc/optimize-split-part
Open

perf: Optimize split_part, support Utf8View#21119
neilconway wants to merge 6 commits intoapache:mainfrom
neilconway:neilc/optimize-split-part

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Mar 23, 2026

Which issue does this PR close?

Rationale for this change

split_part currently accepts Utf8View but always returns Utf8. When given Utf8View input, it should instead return Utf8View output.

While we're at it, optimize split_part for single-character delimiters (the common case): str::split(&str) is significantly slower than str::split(char) for single-character ASCII delimiters, because the former uses a general string matching algorithm but the latter uses memchr::memchr.

Benchmark results (M4 Max):

  • utf8_single_char/pos_first: 142 µs → 104 µs (-26%)
  • utf8_single_char/pos_middle: 389 µs → 365 µs (-6%)
  • utf8_single_char/pos_negative: 154 µs → 109 µs (-29%)
  • utf8_multi_char/pos_middle: 356 µs → 361 µs (~0%, noise)
  • utf8view_single_char/pos_first: 143 µs → 111 µs (-22%)
  • utf8_long_strings/pos_first: 192 µs → 120 µs (-37%)
  • utf8view_long_parts/pos_middle: 998 µs → 470 µs (-53%)

What changes are included in this PR?

  • Revise split_part benchmarks to reduce redundancy and improve Utf8View coverage
  • Support Utf8View -> Utf8View in split_part
  • Refactor split_part to cleanup some redundant code
  • Optimize split_part for single-character delimiters
  • Add SLT test coverage for split_part with Utf8View input

Are these changes tested?

Yes. New tests and benchmarks added.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 23, 2026
@neilconway
Copy link
Contributor Author

split_part can be optimized further; probably scalar specialization would be a nice win. But I'd like to get this PR in first to make it easier to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize split_part for single-character delimiters split_part should preserve Utf8View input

2 participants