Skip to content

perf: pack format specs and avoid literal substrings#767

Open
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/format-fast-path
Open

perf: pack format specs and avoid literal substrings#767
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/format-fast-path

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 12, 2026

Motivation:

Speed up format-heavy Jsonnet programs while keeping the formatter JIT- and GC-friendly. The original PR specialized the all-simple-named-string shape (%(x)s), but the generic format path still allocated per-spec objects / substrings and repeated object lookups in common repeated-key templates.

Design:

  • Represent FormatSpec as a packed AnyVal over a Long; store hot-loop specs in Array[Long].
  • Keep labels outside the value class and avoid the labels array entirely for positional-only formats.
  • Store literal start/end offsets and append from the original format string, avoiding per-literal substring materialization.
  • Add a repeated single-label fast path for all-simple %(key)s templates: one object lookup, one stringification, then offset-based appends.
  • Cache valuesArr, valuesObj, labels, and specBits before the generic hot loop to reduce repeated type tests / field loads.
  • Keep the implementation portable: no JVM-only APIs in shared code; JS and Native run the same semantics and are covered by tests.

Modification:

  • Rebased onto current databricks:master (3a9a492899420456070fb84eaa5b89f8b7dfe1bf).
  • PR head after CI fix: 902c33ff (perf: pack format specs and avoid literal substrings).
  • Fixed the previous CI failure by removing the unused private specAt helper; Scala 2.12/2.13 CI treats that warning as fatal.
  • Packed conversion, flags, width, and precision into FormatSpec.bits; width/precision keep signed encoding for * behavior compatibility.
  • Added lazy label-array construction and adjacent repeated-label de-duplication during format scanning.
  • Added singleNamedLabel metadata for the all-simple same-key case.
  • Extracted shared simple %s stringification into simpleStringValue for the fast path cache.
  • Rejected two extra micro-optimizations after A/B: a scalar single-value shortcut polluted the generic loop, and exact StringBuilder capacity for the single-label path regressed the key JMH run (large_string_template 1.666 ms/op, realistic2 45.131 ms/op).

Correctness:

  • Fixed the previous CI failure root cause: the optimized formatter code left an unused private specAt helper, and cross-version CI rejects that warning as fatal.
  • ./mill "_.jvm[_].__.test": pass (2.12.21, 2.13.18, 3.3.7).
  • ./mill "_.js[_].__.test": pass (2.13.18, 3.3.7).
  • ./mill "_.wasm[_].__.test": pass (2.13.18, 3.3.7).
  • NO_COLOR=1 TERM=dumb ./mill "_.native[_].__.test": pass (2.13.18, 3.3.7).
  • ./mill "_.jvm[_].__.checkFormat": pass (2.12.21, 2.13.18, 3.3.7).
  • git diff --check: pass.

Benchmark Setup:

  • Lower is better for all tables.
  • Master baseline: 3a9a492899420456070fb84eaa5b89f8b7dfe1bf.
  • Benchmark PR head: f70349f8; current PR head is 902c33ff, which only removes an unused private helper required by fatal-warning CI and does not change the benchmarked runtime path.
  • JMH fork VM: JDK 21.0.10 from Mill/Coursier (build.mill pins Mill JVM to Zulu 21). Project compile target is Java 17.
  • JVM args: --enable-native-access=ALL-UNNAMED -Xmx4G -XX:+UseG1GC -Xss100m.
  • Regression speed command: bench.runJmh sjsonnet.bench.RegressionBenchmark.main -p path=<all 36 paths> -wi 3 -i 5 -w 1s -r 1s -f 2 -jvmArgsAppend -Xss100m -rf json.
  • Regression allocation command: same 36 paths with -wi 2 -i 3 -w 1s -r 1s -f 1 -prof gc.
  • Other JMH command: bench.runJmh 'sjsonnet.bench.(MainBenchmark|ParserBenchmark|OptimizerBenchmark|MaterializerBenchmark|MultiThreadedBenchmark).*' -wi 3 -i 5 -w 1s -r 1s -f 2 ....

Speed Summary (RegressionBenchmark.main, 36 cases):

  • Geomean ratio PR/master: 0.9674 (-3.3%).
  • Cases faster by at least 3%: 14/36; cases slower by at least 3%: 0/36.
  • Largest relevant wins are the format-heavy cases and realistic configs; no broad speed regression signal.
Case master ms/op PR ms/op delta
large_string_template 1.741 +/- 0.142 1.541 +/- 0.073 -11.5%
realistic1 1.456 +/- 0.076 1.181 +/- 0.035 -18.9%
realistic2 45.388 +/- 6.470 42.570 +/- 3.590 -6.2%
bench.02 27.486 +/- 0.771 26.149 +/- 0.232 -4.9%
bench.03 6.948 +/- 0.423 6.670 +/- 0.070 -4.0%
bench.07 2.911 +/- 0.075 2.505 +/- 0.343 -14.0%
large_string_join 0.575 +/- 0.036 0.551 +/- 0.017 -4.1%
gen_big_object 0.842 +/- 0.017 0.808 +/- 0.017 -4.1%
Full RegressionBenchmark speed table
Case master ms/op PR ms/op delta
assertions 0.197 +/- 0.001 0.195 +/- 0.002 -1.0%
base64 0.149 +/- 0.008 0.141 +/- 0.001 -5.0%
base64Decode 0.117 +/- 0.005 0.114 +/- 0.000 -2.2%
base64DecodeBytes 5.351 +/- 0.379 5.108 +/- 0.049 -4.6%
base64_byte_array 0.764 +/- 0.016 0.754 +/- 0.012 -1.2%
base64_stress 0.177 +/- 0.004 0.175 +/- 0.003 -0.9%
bench.01 0.045 +/- 0.000 0.046 +/- 0.000 +0.8%
bench.02 27.486 +/- 0.771 26.149 +/- 0.232 -4.9%
bench.03 6.948 +/- 0.423 6.670 +/- 0.070 -4.0%
bench.04 0.111 +/- 0.007 0.108 +/- 0.000 -3.1%
bench.06 0.219 +/- 0.007 0.212 +/- 0.005 -3.2%
bench.07 2.911 +/- 0.075 2.505 +/- 0.343 -14.0%
bench.08 0.039 +/- 0.002 0.038 +/- 0.000 -2.8%
bench.09 0.043 +/- 0.002 0.042 +/- 0.001 -2.2%
comparison 0.029 +/- 0.001 0.029 +/- 0.000 +0.7%
comparison2 17.237 +/- 1.845 16.726 +/- 0.906 -3.0%
escapeStringJson 0.033 +/- 0.003 0.032 +/- 0.000 -2.8%
foldl 0.071 +/- 0.003 0.070 +/- 0.000 -1.1%
gen_big_object 0.842 +/- 0.017 0.808 +/- 0.017 -4.1%
large_string_join 0.575 +/- 0.036 0.551 +/- 0.017 -4.1%
large_string_template 1.741 +/- 0.142 1.541 +/- 0.073 -11.5%
lstripChars 0.114 +/- 0.001 0.113 +/- 0.002 -0.6%
manifestJsonEx 0.052 +/- 0.001 0.052 +/- 0.000 -0.0%
manifestTomlEx 0.068 +/- 0.001 0.068 +/- 0.000 +0.2%
manifestYamlDoc 0.055 +/- 0.001 0.055 +/- 0.000 +0.3%
member 0.649 +/- 0.018 0.631 +/- 0.008 -2.8%
parseInt 0.032 +/- 0.000 0.032 +/- 0.001 +2.1%
realistic1 1.456 +/- 0.076 1.181 +/- 0.035 -18.9%
realistic2 45.388 +/- 6.470 42.570 +/- 3.590 -6.2%
reverse 6.767 +/- 0.327 6.504 +/- 0.118 -3.9%
rstripChars 0.118 +/- 0.004 0.114 +/- 0.002 -3.1%
setDiff 0.400 +/- 0.010 0.394 +/- 0.006 -1.3%
setInter 0.351 +/- 0.006 0.349 +/- 0.005 -0.6%
setUnion 0.584 +/- 0.009 0.578 +/- 0.011 -1.0%
stripChars 0.115 +/- 0.002 0.113 +/- 0.003 -2.1%
substr 0.056 +/- 0.001 0.055 +/- 0.000 -1.8%

Allocation Summary (gc.alloc.rate.norm, 36 cases):

  • Geomean ratio PR/master: 0.9992 (-0.1%).
  • Cases with at least 1% lower allocation: 1/36; cases with at least 1% higher allocation: 2/36.
  • The intentional allocation win is large_string_template from avoiding literal substrings / repeated named-format work. Most other cases are effectively neutral; two tiny non-format cases moved by about +2 KB/op in this one GC run and had neutral speed.
Case master B/op PR B/op delta
large_string_template 7,778,621 +/- 16,846 7,202,783 +/- 17,821 -7.4%
realistic1 5,878,186 +/- 13,612 5,878,173 +/- 14,998 -0.0%
realistic2 75,892,965 +/- 130,874 75,887,830 +/- 71,088 -0.0%
bench.02 81,687,212 +/- 49,373 81,686,806 +/- 38,399 -0.0%
bench.03 5,919,474 +/- 20,439 5,920,036 +/- 19,354 +0.0%
bench.07 2,972,992 +/- 19,208 2,972,420 +/- 20,839 -0.0%
large_string_join 1,533,821 +/- 26,527 1,533,766 +/- 23,545 -0.0%
gen_big_object 4,018,491 +/- 7,328 4,018,171 +/- 8,096 -0.0%
Full RegressionBenchmark allocation table
Case master B/op PR B/op delta
assertions 640,830 +/- 26,731 646,134 +/- 15,212 +0.8%
base64 2,184,177 +/- 3,954 2,184,116 +/- 2,480 -0.0%
base64Decode 1,547,648 +/- 4,518 1,547,575 +/- 448 -0.0%
base64DecodeBytes 234,181 +/- 11,232 234,059 +/- 12,375 -0.1%
base64_byte_array 4,697,281 +/- 11,440 4,696,921 +/- 10,278 -0.0%
base64_stress 843,557 +/- 17,937 842,964 +/- 11,978 -0.1%
bench.01 80,209 +/- 1 82,281 +/- 5 +2.6%
bench.02 81,687,212 +/- 49,373 81,686,806 +/- 38,399 -0.0%
bench.03 5,919,474 +/- 20,439 5,920,036 +/- 19,354 +0.0%
bench.04 633,732 +/- 26 633,868 +/- 2,536 +0.0%
bench.06 879,780 +/- 19,827 880,095 +/- 27,178 +0.0%
bench.07 2,972,992 +/- 19,208 2,972,420 +/- 20,839 -0.0%
bench.08 81,969 +/- 0 81,961 +/- 1 -0.0%
bench.09 88,921 +/- 1 91,329 +/- 1 +2.7%
comparison 60,593 +/- 0 60,633 +/- 0 +0.1%
comparison2 37,673,424 +/- 55,346 37,673,112 +/- 49,976 -0.0%
escapeStringJson 77,305 +/- 2 77,249 +/- 4 -0.1%
foldl 350,659 +/- 2 350,675 +/- 1 +0.0%
gen_big_object 4,018,491 +/- 7,328 4,018,171 +/- 8,096 -0.0%
large_string_join 1,533,821 +/- 26,527 1,533,766 +/- 23,545 -0.0%
large_string_template 7,778,621 +/- 16,846 7,202,783 +/- 17,821 -7.4%
lstripChars 1,533,194 +/- 1,047 1,533,094 +/- 158 -0.0%
manifestJsonEx 132,281 +/- 8 132,073 +/- 3 -0.2%
manifestTomlEx 145,730 +/- 0 145,473 +/- 1 -0.2%
manifestYamlDoc 136,538 +/- 2 136,018 +/- 0 -0.4%
member 2,597,675 +/- 8,075 2,597,578 +/- 15,070 -0.0%
parseInt 68,153 +/- 1 68,241 +/- 0 +0.1%
realistic1 5,878,186 +/- 13,612 5,878,173 +/- 14,998 -0.0%
realistic2 75,892,965 +/- 130,874 75,887,830 +/- 71,088 -0.0%
reverse 11,839,431 +/- 18,292 11,838,844 +/- 17,997 -0.0%
rstripChars 1,532,847 +/- 14,967 1,532,429 +/- 118 -0.0%
setDiff 1,331,017 +/- 22,935 1,330,495 +/- 28,668 -0.0%
setInter 1,271,016 +/- 28,870 1,268,096 +/- 30,052 -0.2%
setUnion 1,512,384 +/- 15,885 1,511,431 +/- 19,625 -0.1%
stripChars 1,523,254 +/- 3,126 1,523,147 +/- 82 -0.0%
substr 507,044 +/- 1 507,140 +/- 0 +0.0%

Other JMH Results:

Benchmark master ms/op PR ms/op delta
MainBenchmark.main 2.709 +/- 0.507 2.461 +/- 0.189 -9.2%
OptimizerBenchmark.main 0.506 +/- 0.018 0.509 +/- 0.006 +0.5%
ParserBenchmark.main 1.447 +/- 0.049 1.404 +/- 0.008 -2.9%

Known benchmark failures:

  • MaterializerBenchmark.* fails on both master and this PR with NoSuchElementException: None.get at MaterializerBenchmark.scala:43.
  • MultiThreadedBenchmark.main fails on both master and this PR with std.assertEqual / ExecutionException.
  • These failures are not introduced by this PR; successful non-Regression JMH entries are listed above.

Result:

The updated PR keeps the original format-heavy win, improves broader formatter hot-loop shape, and is JIT/GC friendly: primitive spec storage, indexed arrays, offset appends, no tuple/Option allocation in the hot loop, and a single lookup/stringification path for repeated same-key templates. Full JMH vs current master shows speed-positive or neutral behavior, with allocation improvement concentrated in the intended format-heavy template case.

@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 12, 2026

Maybe bits operation is better for tagging.

Comment thread sjsonnet/src/sjsonnet/Format.scala Outdated
Comment thread sjsonnet/src/sjsonnet/Format.scala Outdated
@He-Pin He-Pin force-pushed the perf/format-fast-path branch 3 times, most recently from f0bb14f to 7fb2010 Compare April 12, 2026 17:35
@He-Pin He-Pin marked this pull request as ready for review April 12, 2026 17:48
@He-Pin He-Pin marked this pull request as draft April 12, 2026 18:57
@He-Pin He-Pin force-pushed the perf/format-fast-path branch 3 times, most recently from cb41844 to 5dcc543 Compare April 26, 2026 10:47
@He-Pin He-Pin closed this Apr 26, 2026
@He-Pin He-Pin reopened this Apr 26, 2026
@He-Pin He-Pin marked this pull request as ready for review April 26, 2026 11:19
@He-Pin He-Pin force-pushed the perf/format-fast-path branch from 5dcc543 to 3584eb1 Compare April 26, 2026 11:35
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 26, 2026

Reviewed. Bit-packed/tagged dispatch may be useful as a separate evaluator-wide optimization, but I am not folding it into this PR: this branch is intentionally scoped to the simple named %s format fast path. I also adjusted the fast path to use the same pos as the generic object lookup path, then reran JVM tests and JMH; the target JMH results remain positive.

@He-Pin He-Pin marked this pull request as draft April 28, 2026 18:58
@He-Pin He-Pin force-pushed the perf/format-fast-path branch from 3584eb1 to 5e24587 Compare April 30, 2026 09:18
@He-Pin He-Pin changed the title perf: fast path for simple named string format patterns perf: pack format specs and avoid literal substrings Apr 30, 2026
@He-Pin He-Pin marked this pull request as ready for review April 30, 2026 09:44
@He-Pin He-Pin marked this pull request as draft April 30, 2026 10:16
@He-Pin He-Pin force-pushed the perf/format-fast-path branch from 5e24587 to f70349f Compare April 30, 2026 11:18
@He-Pin He-Pin force-pushed the perf/format-fast-path branch from f70349f to 902c33f Compare April 30, 2026 11:34
@He-Pin He-Pin marked this pull request as ready for review April 30, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant