perf: pack format specs and avoid literal substrings by He-Pin · Pull Request #767 · databricks/sjsonnet

He-Pin · 2026-04-12T13:50:19Z

Motivation:

Speed up format-heavy Jsonnet programs while keeping the formatter JIT- and GC-friendly. The original PR specialized the all-simple-named-string shape (%(x)s), but the generic format path still allocated per-spec objects / substrings and repeated object lookups in common repeated-key templates.

Design:

Represent FormatSpec as a packed AnyVal over a Long; store hot-loop specs in Array[Long].
Keep labels outside the value class and avoid the labels array entirely for positional-only formats.
Store literal start/end offsets and append from the original format string, avoiding per-literal substring materialization.
Add a repeated single-label fast path for all-simple %(key)s templates: one object lookup, one stringification, then offset-based appends.
Cache valuesArr, valuesObj, labels, and specBits before the generic hot loop to reduce repeated type tests / field loads.
Keep the implementation portable: no JVM-only APIs in shared code; JS and Native run the same semantics and are covered by tests.

Modification:

Rebased onto current databricks:master (3a9a492899420456070fb84eaa5b89f8b7dfe1bf).
PR head after CI fix: 902c33ff (perf: pack format specs and avoid literal substrings).
Fixed the previous CI failure by removing the unused private specAt helper; Scala 2.12/2.13 CI treats that warning as fatal.
Packed conversion, flags, width, and precision into FormatSpec.bits; width/precision keep signed encoding for * behavior compatibility.
Added lazy label-array construction and adjacent repeated-label de-duplication during format scanning.
Added singleNamedLabel metadata for the all-simple same-key case.
Extracted shared simple %s stringification into simpleStringValue for the fast path cache.
Rejected two extra micro-optimizations after A/B: a scalar single-value shortcut polluted the generic loop, and exact StringBuilder capacity for the single-label path regressed the key JMH run (large_string_template 1.666 ms/op, realistic2 45.131 ms/op).

Correctness:

Fixed the previous CI failure root cause: the optimized formatter code left an unused private specAt helper, and cross-version CI rejects that warning as fatal.
./mill "_.jvm[_].__.test": pass (2.12.21, 2.13.18, 3.3.7).
./mill "_.js[_].__.test": pass (2.13.18, 3.3.7).
./mill "_.wasm[_].__.test": pass (2.13.18, 3.3.7).
NO_COLOR=1 TERM=dumb ./mill "_.native[_].__.test": pass (2.13.18, 3.3.7).
./mill "_.jvm[_].__.checkFormat": pass (2.12.21, 2.13.18, 3.3.7).
git diff --check: pass.

Benchmark Setup:

Lower is better for all tables.
Master baseline: 3a9a492899420456070fb84eaa5b89f8b7dfe1bf.
Benchmark PR head: f70349f8; current PR head is 902c33ff, which only removes an unused private helper required by fatal-warning CI and does not change the benchmarked runtime path.
JMH fork VM: JDK 21.0.10 from Mill/Coursier (build.mill pins Mill JVM to Zulu 21). Project compile target is Java 17.
JVM args: --enable-native-access=ALL-UNNAMED -Xmx4G -XX:+UseG1GC -Xss100m.
Regression speed command: bench.runJmh sjsonnet.bench.RegressionBenchmark.main -p path=<all 36 paths> -wi 3 -i 5 -w 1s -r 1s -f 2 -jvmArgsAppend -Xss100m -rf json.
Regression allocation command: same 36 paths with -wi 2 -i 3 -w 1s -r 1s -f 1 -prof gc.
Other JMH command: bench.runJmh 'sjsonnet.bench.(MainBenchmark|ParserBenchmark|OptimizerBenchmark|MaterializerBenchmark|MultiThreadedBenchmark).*' -wi 3 -i 5 -w 1s -r 1s -f 2 ....

Speed Summary (RegressionBenchmark.main, 36 cases):

Geomean ratio PR/master: 0.9674 (-3.3%).
Cases faster by at least 3%: 14/36; cases slower by at least 3%: 0/36.
Largest relevant wins are the format-heavy cases and realistic configs; no broad speed regression signal.

Case	master ms/op	PR ms/op	delta
`large_string_template`	1.741 +/- 0.142	1.541 +/- 0.073	-11.5%
`realistic1`	1.456 +/- 0.076	1.181 +/- 0.035	-18.9%
`realistic2`	45.388 +/- 6.470	42.570 +/- 3.590	-6.2%
`bench.02`	27.486 +/- 0.771	26.149 +/- 0.232	-4.9%
`bench.03`	6.948 +/- 0.423	6.670 +/- 0.070	-4.0%
`bench.07`	2.911 +/- 0.075	2.505 +/- 0.343	-14.0%
`large_string_join`	0.575 +/- 0.036	0.551 +/- 0.017	-4.1%
`gen_big_object`	0.842 +/- 0.017	0.808 +/- 0.017	-4.1%

Full RegressionBenchmark speed table

Case	master ms/op	PR ms/op	delta
`assertions`	0.197 +/- 0.001	0.195 +/- 0.002	-1.0%
`base64`	0.149 +/- 0.008	0.141 +/- 0.001	-5.0%
`base64Decode`	0.117 +/- 0.005	0.114 +/- 0.000	-2.2%
`base64DecodeBytes`	5.351 +/- 0.379	5.108 +/- 0.049	-4.6%
`base64_byte_array`	0.764 +/- 0.016	0.754 +/- 0.012	-1.2%
`base64_stress`	0.177 +/- 0.004	0.175 +/- 0.003	-0.9%
`bench.01`	0.045 +/- 0.000	0.046 +/- 0.000	+0.8%
`bench.02`	27.486 +/- 0.771	26.149 +/- 0.232	-4.9%
`bench.03`	6.948 +/- 0.423	6.670 +/- 0.070	-4.0%
`bench.04`	0.111 +/- 0.007	0.108 +/- 0.000	-3.1%
`bench.06`	0.219 +/- 0.007	0.212 +/- 0.005	-3.2%
`bench.07`	2.911 +/- 0.075	2.505 +/- 0.343	-14.0%
`bench.08`	0.039 +/- 0.002	0.038 +/- 0.000	-2.8%
`bench.09`	0.043 +/- 0.002	0.042 +/- 0.001	-2.2%
`comparison`	0.029 +/- 0.001	0.029 +/- 0.000	+0.7%
`comparison2`	17.237 +/- 1.845	16.726 +/- 0.906	-3.0%
`escapeStringJson`	0.033 +/- 0.003	0.032 +/- 0.000	-2.8%
`foldl`	0.071 +/- 0.003	0.070 +/- 0.000	-1.1%
`gen_big_object`	0.842 +/- 0.017	0.808 +/- 0.017	-4.1%
`large_string_join`	0.575 +/- 0.036	0.551 +/- 0.017	-4.1%
`large_string_template`	1.741 +/- 0.142	1.541 +/- 0.073	-11.5%
`lstripChars`	0.114 +/- 0.001	0.113 +/- 0.002	-0.6%
`manifestJsonEx`	0.052 +/- 0.001	0.052 +/- 0.000	-0.0%
`manifestTomlEx`	0.068 +/- 0.001	0.068 +/- 0.000	+0.2%
`manifestYamlDoc`	0.055 +/- 0.001	0.055 +/- 0.000	+0.3%
`member`	0.649 +/- 0.018	0.631 +/- 0.008	-2.8%
`parseInt`	0.032 +/- 0.000	0.032 +/- 0.001	+2.1%
`realistic1`	1.456 +/- 0.076	1.181 +/- 0.035	-18.9%
`realistic2`	45.388 +/- 6.470	42.570 +/- 3.590	-6.2%
`reverse`	6.767 +/- 0.327	6.504 +/- 0.118	-3.9%
`rstripChars`	0.118 +/- 0.004	0.114 +/- 0.002	-3.1%
`setDiff`	0.400 +/- 0.010	0.394 +/- 0.006	-1.3%
`setInter`	0.351 +/- 0.006	0.349 +/- 0.005	-0.6%
`setUnion`	0.584 +/- 0.009	0.578 +/- 0.011	-1.0%
`stripChars`	0.115 +/- 0.002	0.113 +/- 0.003	-2.1%
`substr`	0.056 +/- 0.001	0.055 +/- 0.000	-1.8%

Allocation Summary (gc.alloc.rate.norm, 36 cases):

Geomean ratio PR/master: 0.9992 (-0.1%).
Cases with at least 1% lower allocation: 1/36; cases with at least 1% higher allocation: 2/36.
The intentional allocation win is large_string_template from avoiding literal substrings / repeated named-format work. Most other cases are effectively neutral; two tiny non-format cases moved by about +2 KB/op in this one GC run and had neutral speed.

Case	master B/op	PR B/op	delta
`large_string_template`	7,778,621 +/- 16,846	7,202,783 +/- 17,821	-7.4%
`realistic1`	5,878,186 +/- 13,612	5,878,173 +/- 14,998	-0.0%
`realistic2`	75,892,965 +/- 130,874	75,887,830 +/- 71,088	-0.0%
`bench.02`	81,687,212 +/- 49,373	81,686,806 +/- 38,399	-0.0%
`bench.03`	5,919,474 +/- 20,439	5,920,036 +/- 19,354	+0.0%
`bench.07`	2,972,992 +/- 19,208	2,972,420 +/- 20,839	-0.0%
`large_string_join`	1,533,821 +/- 26,527	1,533,766 +/- 23,545	-0.0%
`gen_big_object`	4,018,491 +/- 7,328	4,018,171 +/- 8,096	-0.0%

Full RegressionBenchmark allocation table

Case	master B/op	PR B/op	delta
`assertions`	640,830 +/- 26,731	646,134 +/- 15,212	+0.8%
`base64`	2,184,177 +/- 3,954	2,184,116 +/- 2,480	-0.0%
`base64Decode`	1,547,648 +/- 4,518	1,547,575 +/- 448	-0.0%
`base64DecodeBytes`	234,181 +/- 11,232	234,059 +/- 12,375	-0.1%
`base64_byte_array`	4,697,281 +/- 11,440	4,696,921 +/- 10,278	-0.0%
`base64_stress`	843,557 +/- 17,937	842,964 +/- 11,978	-0.1%
`bench.01`	80,209 +/- 1	82,281 +/- 5	+2.6%
`bench.02`	81,687,212 +/- 49,373	81,686,806 +/- 38,399	-0.0%
`bench.03`	5,919,474 +/- 20,439	5,920,036 +/- 19,354	+0.0%
`bench.04`	633,732 +/- 26	633,868 +/- 2,536	+0.0%
`bench.06`	879,780 +/- 19,827	880,095 +/- 27,178	+0.0%
`bench.07`	2,972,992 +/- 19,208	2,972,420 +/- 20,839	-0.0%
`bench.08`	81,969 +/- 0	81,961 +/- 1	-0.0%
`bench.09`	88,921 +/- 1	91,329 +/- 1	+2.7%
`comparison`	60,593 +/- 0	60,633 +/- 0	+0.1%
`comparison2`	37,673,424 +/- 55,346	37,673,112 +/- 49,976	-0.0%
`escapeStringJson`	77,305 +/- 2	77,249 +/- 4	-0.1%
`foldl`	350,659 +/- 2	350,675 +/- 1	+0.0%
`gen_big_object`	4,018,491 +/- 7,328	4,018,171 +/- 8,096	-0.0%
`large_string_join`	1,533,821 +/- 26,527	1,533,766 +/- 23,545	-0.0%
`large_string_template`	7,778,621 +/- 16,846	7,202,783 +/- 17,821	-7.4%
`lstripChars`	1,533,194 +/- 1,047	1,533,094 +/- 158	-0.0%
`manifestJsonEx`	132,281 +/- 8	132,073 +/- 3	-0.2%
`manifestTomlEx`	145,730 +/- 0	145,473 +/- 1	-0.2%
`manifestYamlDoc`	136,538 +/- 2	136,018 +/- 0	-0.4%
`member`	2,597,675 +/- 8,075	2,597,578 +/- 15,070	-0.0%
`parseInt`	68,153 +/- 1	68,241 +/- 0	+0.1%
`realistic1`	5,878,186 +/- 13,612	5,878,173 +/- 14,998	-0.0%
`realistic2`	75,892,965 +/- 130,874	75,887,830 +/- 71,088	-0.0%
`reverse`	11,839,431 +/- 18,292	11,838,844 +/- 17,997	-0.0%
`rstripChars`	1,532,847 +/- 14,967	1,532,429 +/- 118	-0.0%
`setDiff`	1,331,017 +/- 22,935	1,330,495 +/- 28,668	-0.0%
`setInter`	1,271,016 +/- 28,870	1,268,096 +/- 30,052	-0.2%
`setUnion`	1,512,384 +/- 15,885	1,511,431 +/- 19,625	-0.1%
`stripChars`	1,523,254 +/- 3,126	1,523,147 +/- 82	-0.0%
`substr`	507,044 +/- 1	507,140 +/- 0	+0.0%

Other JMH Results:

Benchmark	master ms/op	PR ms/op	delta
`MainBenchmark.main`	2.709 +/- 0.507	2.461 +/- 0.189	-9.2%
`OptimizerBenchmark.main`	0.506 +/- 0.018	0.509 +/- 0.006	+0.5%
`ParserBenchmark.main`	1.447 +/- 0.049	1.404 +/- 0.008	-2.9%

Known benchmark failures:

MaterializerBenchmark.* fails on both master and this PR with NoSuchElementException: None.get at MaterializerBenchmark.scala:43.
MultiThreadedBenchmark.main fails on both master and this PR with std.assertEqual / ExecutionException.
These failures are not introduced by this PR; successful non-Regression JMH entries are listed above.

Result:

The updated PR keeps the original format-heavy win, improves broader formatter hot-loop shape, and is JIT/GC friendly: primitive spec storage, indexed arrays, offset appends, no tuple/Option allocation in the hot loop, and a single lookup/stringification path for repeated same-key templates. Full JMH vs current master shows speed-positive or neutral behavior, with allocation improvement concentrated in the intended format-heavy template case.

He-Pin · 2026-04-12T14:10:58Z

Maybe bits operation is better for tagging.

He-Pin · 2026-04-26T11:36:12Z

Reviewed. Bit-packed/tagged dispatch may be useful as a separate evaluator-wide optimization, but I am not folding it into this PR: this branch is intentionally scoped to the simple named %s format fast path. I also adjusted the fast path to use the same pos as the generic object lookup path, then reran JVM tests and JMH; the target JMH results remain positive.

He-Pin commented Apr 12, 2026

View reviewed changes

Comment thread sjsonnet/src/sjsonnet/Format.scala Outdated

He-Pin commented Apr 12, 2026

View reviewed changes

Comment thread sjsonnet/src/sjsonnet/Format.scala Outdated

He-Pin force-pushed the perf/format-fast-path branch 3 times, most recently from f0bb14f to 7fb2010 Compare April 12, 2026 17:35

He-Pin marked this pull request as ready for review April 12, 2026 17:48

He-Pin mentioned this pull request Apr 12, 2026

performance optimization #666

Open

He-Pin marked this pull request as draft April 12, 2026 18:57

He-Pin force-pushed the perf/format-fast-path branch 3 times, most recently from cb41844 to 5dcc543 Compare April 26, 2026 10:47

He-Pin closed this Apr 26, 2026

He-Pin reopened this Apr 26, 2026

He-Pin marked this pull request as ready for review April 26, 2026 11:19

He-Pin force-pushed the perf/format-fast-path branch from 5dcc543 to 3584eb1 Compare April 26, 2026 11:35

He-Pin marked this pull request as draft April 28, 2026 18:58

He-Pin force-pushed the perf/format-fast-path branch from 3584eb1 to 5e24587 Compare April 30, 2026 09:18

He-Pin changed the title ~~perf: fast path for simple named string format patterns~~ perf: pack format specs and avoid literal substrings Apr 30, 2026

He-Pin marked this pull request as ready for review April 30, 2026 09:44

He-Pin marked this pull request as draft April 30, 2026 10:16

He-Pin force-pushed the perf/format-fast-path branch from 5e24587 to f70349f Compare April 30, 2026 11:18

perf: pack format specs and avoid literal substrings

902c33f

He-Pin force-pushed the perf/format-fast-path branch from f70349f to 902c33f Compare April 30, 2026 11:34

He-Pin marked this pull request as ready for review April 30, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: pack format specs and avoid literal substrings#767

perf: pack format specs and avoid literal substrings#767
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/format-fast-path

He-Pin commented Apr 12, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 12, 2026

Uh oh!

Uh oh!

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Pin commented Apr 12, 2026

Uh oh!

Uh oh!

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented Apr 12, 2026 •

edited

Loading