perf: cache renderDouble for small integers#763
Draft
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Draft
perf: cache renderDouble for small integers#763He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Conversation
12b870c to
4c8c5ab
Compare
3ae5a56 to
5e55449
Compare
Collaborator
|
I'm questioning the usefulness of this tweak (for numbers) - this seems quite neutral |
Contributor
Author
|
It just a bit optimization around realistic2 where many small nunbers. |
stephenamar-db
pushed a commit
that referenced
this pull request
Apr 30, 2026
Split out from #763. Motivation: Reduce allocation and dispatch overhead when one- and three-argument builtins are called through the dynamic `Val.Func.apply1` / `apply3` path. Key Design Decision: Keep the optimization local and semantics-preserving. `Builtin2` already has an exact-arity `apply2` override; this adds matching `Builtin1.apply1` and `Builtin3.apply3` overrides. Exact positional calls directly invoke the structured `evalRhs` overload and skip constructing an intermediate `Array`. Non-exact paths still fall back to the generic parent application path. Correctness: - The direct path matches the existing `Builtin1.apply` / `Builtin3.apply` exact positional behavior: force the supplied `Eval` values, then call the typed `evalRhs`. - Named arguments, missing defaults, too many arguments, and other non-exact calls still use the generic function application logic. - Static `Expr.ApplyBuiltin1` / `Expr.ApplyBuiltin3` paths are unchanged; this only helps dynamic builtin calls such as a builtin stored in a local or returned from another function. Modification: - Add `Builtin1.apply1`. - Add `Builtin3.apply3`. Validation: - `./mill --no-server 'sjsonnet.jvm[3.3.7].compile'` - `./mill --no-server 'sjsonnet.jvm[3.3.7].test'` (`141/141, SUCCESS`) - `./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'` - `./mill --no-server '_.jvm[_].__.test'` (`1104/1104, SUCCESS`) - Dynamic builtin smoke checks: - `local f = std.length; f([1, 2, 3])` -> `3` - `local f = std.substr; f("abcdef", 1, 3)` -> `"bcd"` - `local f = std.substr; f(str="abcdef", from=1, len=3)` -> `"bcd"` Hyperfine: Toolchain: - `hyperfine 1.20.0` - `--warmup 3 --min-runs 25` for targeted dynamic builtin benchmarks - `--warmup 3 --min-runs 20` for `realistic2` - JVM assemblies built with `./mill --no-server show 'sjsonnet.jvm[3.3.7].assembly'` - Base: `upstream/master` at `c04fc804` - Branch: `2067d8b5` Targeted `Builtin1` dynamic call benchmark: ```jsonnet local identity(x) = x; local len = identity(std.length); std.foldl( function(acc, i) acc + len("abcdef"), std.range(1, 5000000), 0 ) ``` | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `master builtin1_dynamic` | 649.8 +/- 48.3 | 557.4 | 726.8 | 1.00 | | `branch builtin1_dynamic` | 661.7 +/- 41.0 | 606.1 | 747.3 | 1.02 +/- 0.10 | Result: statistically neutral in this hyperfine run. Targeted `Builtin3` dynamic call benchmark: ```jsonnet local identity(x) = x; local substr = identity(std.substr); std.foldl( function(acc, i) acc + std.length(substr("abcdef", 1, 3)), std.range(1, 3000000), 0 ) ``` | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `master builtin3_dynamic` | 742.5 +/- 156.1 | 594.1 | 1254.7 | 1.12 +/- 0.30 | | `branch builtin3_dynamic` | 660.4 +/- 110.9 | 534.0 | 962.6 | 1.00 | Result: branch was faster in this run, but variance is high. End-to-end `realistic2`: | Command | Mean [ms] | Min [ms] | Max [ms] | Relative | |:---|---:|---:|---:|---:| | `master realistic2` | 544.9 +/- 95.3 | 414.1 | 706.5 | 1.27 +/- 0.27 | | `branch realistic2` | 428.4 +/- 54.1 | 378.3 | 565.8 | 1.00 | Result: branch was faster in this run; due JVM-startup and system noise, treat this as a non-regression signal rather than a guaranteed 1.27x speedup.
Motivation: Reduce allocation overhead in common numeric rendering paths. Modification: 1. RenderUtils.renderDouble reuses pre-cached string representations for exact integer doubles in the range 0-255. 2. Materializer.stringify delegates number stringification to RenderUtils.renderDouble, removing its duplicate integer fast path. Result: Numeric materialization uses the shared renderDouble fast path. The Builtin1.apply1 / Builtin3.apply3 specialization from the original PR is already present in current master via databricks#807, so it is no longer part of this PR diff.
5e55449 to
eb50f7f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation:
Reduce allocation overhead in common numeric rendering paths.
Key Design Decision:
Keep this PR focused on the
renderDoubleoptimization. The originalBuiltin1.apply1/Builtin3.apply3specialization has already landed in currentmastervia #807, so it is intentionally no longer part of this PR diff after the rebase.Modification:
Materializerdelegates number stringification toRenderUtils.renderDouble.RenderUtils.renderDoublereuses a small integer string cache for exact integer doubles in the range0to255.Benchmark Results:
JMH throughput, normalized from
ms/optoops/ms(ops/ms = 1 / ms_per_op; higher is better):large_string_joinlarge_string_templaterealistic2parseIntScala Native hyperfine against source-built jrsonnet (
git@github.com:CertainLach/jrsonnet.git, origin/master80cd36a; lower is better):realistic2parseIntAnalysis:
The JMH signal is strongly positive on numeric-heavy rendering workloads because exact small integer doubles avoid repeated string allocation through the shared
renderDoublepath. Scala Native is effectively neutral on these two hyperfine cases; theparseIntrun reported outliers and is dominated by process startup at this input size.Verification:
./mill --no-server bench.runRegressions bench/resources/cpp_suite/large_string_join.jsonnet bench/resources/cpp_suite/large_string_template.jsonnet bench/resources/cpp_suite/realistic2.jsonnet bench/resources/go_suite/parseInt.jsonnethyperfine --warmup 10 --min-runs 50 -N./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'./mill --no-server 'sjsonnet.jvm[3.3.7].test'References:
perf/cached-render-doublemasterat192bb3374b8008f8dd14e1c8f0724237c31da8e1eb50f7f1d831693fd1c14e5919bf467d66df5155jrsonnet 0.5.0-pre98Result:
Ready. The PR now contains only the cached
renderDoublework; the Builtin1/3 apply override work is already covered by #807 onmaster.