perf: use unsynchronized StringBuilderWriter in std.deepJoin#889
Merged
stephenamar-db merged 1 commit intoJun 3, 2026
Merged
Conversation
Motivation: `std.deepJoin` writes each Val.Str chunk into a `java.io.StringWriter` inside a tight loop. StringWriter's backing `StringBuffer` pays a monitor enter/exit on every `write`/`append` call, which on a typical deepJoin walk over a deeply nested array can be hundreds of thousands of synchronized writes. `TomlRenderer` and `FastMaterializeJsonRenderer` already use the unsynchronized package-private `StringBuilderWriter` for the same reason (see databricks#874, databricks#875). deepJoin was explicitly left as a follow-up in databricks#875's description ("std.deepJoin keeps StringWriter (separate concern)") — this is that follow-up. Modification: Swap the single `new StringWriter()` in `DeepJoin.evalRhs` for `new StringBuilderWriter()`. No other changes; output is byte-identical. Result: Scala Native hyperfine, A/B against master (`fc292fa6`). Workload: a 50000-row array of 10-string rows → 2 MB of deepJoin output, render-dominated. Four interleaved-order passes (`--warmup 10 --min-runs 100 --shell=none`): | pass | order | before mean | after mean | before min | after min | min ratio | |---|---|---:|---:|---:|---:|---:| | 1 | before → after | 35.1 ± 16.5 ms | 32.2 ± 19.1 ms | 23.1 ms | 18.7 ms | 1.24x | | 2 | after → before | 43.7 ± 30.6 ms | 29.9 ± 25.3 ms | 25.7 ms | 20.3 ms | 1.27x | | 3 | before → after | 30.3 ± 8.5 ms | 29.5 ± 7.1 ms | 24.6 ms | 20.8 ms | 1.18x | | 4 | after → before | 32.6 ± 7.6 ms | 28.0 ± 6.8 ms | 24.0 ms | 20.7 ms | 1.16x | After is faster in every one of the 4 passes; min values are tight at 1.16-1.27x faster. Output byte-identical (2,000,000 bytes both sides).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
std.deepJoinwrites eachVal.Strchunk into ajava.io.StringWriterinside a tight loop.StringWriter's backingStringBufferpays a monitor enter/exit on everywrite/appendcall, which on a typical deepJoin walk over a deeply nested array can be hundreds of thousands of synchronized writes — wasted overhead in single-threaded jsonnet evaluation.TomlRendererandFastMaterializeJsonRendereralready use the unsynchronized package-privateStringBuilderWriterfor the same reason (#874, #875).std.deepJoinwas explicitly left as a follow-up in #875's description ("std.deepJoin keeps StringWriter (separate concern)") — this PR is that follow-up.Modification
Single change in
ManifestModule.scala: swap thenew StringWriter()inDeepJoin.evalRhsfornew StringBuilderWriter(). No other code changes; output is byte-identical.Result
Scala Native, hyperfine A/B against
master(fc292fa6). Workload: a 50,000-row array of 10 pre-allocated strings → 2 MB ofdeepJoinoutput, render-dominated. Four interleaved-order passes,--warmup 10 --min-runs 100 --shell=none:After is faster in every one of the 4 passes; mean is noisy on the host but min values are tight at 1.16–1.27× faster (best observed 18.7 vs 23.1 ms, ~19% reduction). Output byte-identical (2,000,000 bytes both sides).
Test plan
./mill __.reformat./mill 'sjsonnet.jvm[3.3.7]'.test— 519/519 pass