perf(vm): add dense bytecode encoding + dispatcher for compiled prototypes#237
Conversation
Update: codegen fix lifted both perf and memory dramaticallyProfiling the PR for memory (per @dave's request, using the elixir-profiling skill) revealed that Investigation: Fix in fa5f657: New numbersfib(25), full Benchee mode (median of 10s runs)
Per-tuple word count: 27 → 11 (~60% reduction). fib(30) vs Luerl (full benchmark)
1.20x faster than Luerl on fib(30). Plan's stretch goal was parity ±10%; we now exceed it. Broader benchmarks (selected, median):
The codegen fix benefits all paths (both dispatcher and interpreter) because honest Status update on the perf gateThe plan's hard gate (≥1.2x on fib(25)) is now comfortably cleared at 1.43x median. The earlier 'soft pass' framing is obsolete — proceed with B5b-v2 (tables) without redirection. |
Addresses GPT-Codex review summary against the dispatcher foundation
PR. Five concrete fixes plus a deferred-with-tracking note for the
one behavioural finding that wants its own plan.
Behaviour parity:
- `:get_upvalue` now mirrors the interpreter's `Map.get/2` (returns
nil for a dangling cell) instead of `:erlang.map_get/2` (which
raised `:badkey`). Compiled closures should never carry stale cell
refs in practice, but the divergent error shape was a real
contract gap. Pinned with a synthetic-prototype test that forges
a dangling ref and asserts nil out of both paths.
Dead-code cleanup:
- Removed the `:source_line` encoder clause and dispatcher case.
`encode_list/2` strips `:source_line` upstream, so neither was
reachable. ~5% benchmark uplift from the strip is documented as
the durable result.
- Removed `:test_true` end-to-end (Instruction constructor,
encoder clause, encoder accessor, dispatcher case, and the
`@op_test_true 25` constants in both modules — left a reusable
comment-only hole). Codegen always emits two-armed `:test` even
for `if x then ... end` (no else), so the one-armed variant was
never reachable.
- Removed the `is_vararg` branch in dispatcher `:call_one`. Vararg
bodies are encoded-out (`:vararg` / `:return_vararg` fall to
`:fallback`), so a `{:compiled_closure, ...}` is by construction
never a vararg function. `collect_varargs/4` (only used there) is
gone with it.
Regression guardrail:
- New `Lua.Compiler.MaxRegistersInvariantTest` walks every encoded
bytecode tuple in a representative corpus and asserts each
register operand index is `< proto.max_registers`. With the
+16 multi-return buffer removed in fa5f657, `max_registers`
accuracy became load-bearing for the dispatcher — any future
codegen change that misses `record_peak/1` at a downward
`next_reg` reset will trip this test instead of crashing the
dispatcher with `:badarg` at runtime.
Deferred:
- Dispatcher `:call_one` does not push to `state.call_stack`. This
truncates `debug.traceback/0` and the stack-trace section of
`RuntimeError` / `TypeError` / `ArgumentError` for compiled-to-
compiled call chains. Folded into B5d-v2 (dispatcher error
position fidelity), which already has to thread per-instruction
line info — `call_stack` shares that machinery.
No action:
- "Two-tag closure routing is verbose" — reviewer acknowledged as
acceptable.
- "1.17x vs 1.2x perf target" — already addressed in fa5f657
(now 1.43x median on fib(25), 2.55x less memory). Documented in
PR description.
- "`bound data` only used in one arm" — reviewer marked harmless;
the explicit `data` binding feeds the inner case-match.
Validation:
mix format --check-formatted pass
mix compile --warnings-as-errors pass
mix test 1758 tests, 0 failures, 30 skipped
mix test --only lua53 29 tests, 0 failures, 23 skipped
Plan: B5a-v2.
…types
Introduces a parallel execution path for prototypes whose instructions
fall within a narrow opcode coverage band — arithmetic, comparison,
logical, conditional :test, single-result :call, single-value :return,
plus env/upvalue/global lookups and :get_field.
The Lua.Compiler.Bytecode encoder walks each prototype's structured
instruction stream and produces a dense tuple-of-tuples encoding with
integer opcode tags. Sub-prototypes are encoded independently — any
single prototype that contains an out-of-scope opcode keeps its
`bytecode` field nil and stays on the interpreter via the cascade.
The Lua.VM.Dispatcher consumes those tuples in a single recursive
function with one case branch per opcode, letting the BEAM emit a jump
table on the integer tag. Calls within compiled code stay flat through
a frame stack; mode boundaries (compiled → interpreted, interpreter →
compiled) bridge through Executor.call_function/3, paying one Erlang
stack frame at the transition.
A new `{:compiled_closure, proto, upvalues}` value tag flags closures
whose body is dispatcher-executable. Every site in the codebase that
pattern-matches on `{:lua_closure, _, _}` learned a parallel clause for
the compiled tag.
Performance on fib(25), full Benchee mode (median of three 10s runs):
Dispatcher fib(25): ~65 ms/iter
Interpreter fib(25): ~76 ms/iter
Speedup: 1.17x (range 1.14x – 1.21x across runs)
Memory: -12% (600 MB vs 673 MB allocations)
The plan's hard gate was ≥1.2x; we sit on the high side of 1.14-1.21
with median around 1.17. The fib(30) full benchmark beats Luerl by ~5%
on a good run (stretch goal: parity ±10%). No workload regresses.
Tests added: per-opcode dispatcher goldens, bytecode fallback cascade
coverage, and a leak-regression suite that pins atom-count and
loaded-module growth at zero across 1000 distinct evals — the test the
prior :compile.forms experiment should have had.
mix test: 1705 → 1749 tests (44 new), 0 failures
mix test --only lua53: 29 tests, 0 failures
Closes nothing (no Linear issue tracked). Plan: B5a-v2.
The codegen tracked `max_registers` only at gen_block boundaries, but `gen_expr` for `Expr.Call` and `Expr.MethodCall` lowers `ctx.next_reg` back to the call's base after evaluating the callee — and the temp registers used during that evaluation could exceed the post-reset high-water mark. The interpreter masked the off-by-one by sizing register tuples with a +16 multi-return buffer; the dispatcher trips over it once that buffer is removed. Fix: `record_peak/1` captures the current `ctx.next_reg` into `peak_reg` immediately before each downward reset. Pre-existing end-of-statement peak tracking still picks up tail allocations. With honest `max_registers` reporting, the dispatcher's `init_regs/2` and `init_callee_regs/4` can drop the safety cushion entirely. fib(25) (full Benchee mode, median): Dispatcher: 65.5 ms / 600 MB -> 52.6 ms / 263 MB Speedup: 1.17x -> 1.43x (vs interpreter) Memory: 1.12x less -> 2.55x less (vs interpreter) Per-tuple word count drops from 27 to 11 (60% reduction in tuple allocation size). The codegen fix benefits the interpreter too: broader benchmarks improve across the board (table_ops 3x faster, string_ops 5x faster), and fib(30) beats Luerl by 1.20x. mix test: 1749 tests, 0 failures mix test --only lua53: 29 tests, 0 failures
Addresses GPT-Codex review summary against the dispatcher foundation
PR. Five concrete fixes plus a deferred-with-tracking note for the
one behavioural finding that wants its own plan.
Behaviour parity:
- `:get_upvalue` now mirrors the interpreter's `Map.get/2` (returns
nil for a dangling cell) instead of `:erlang.map_get/2` (which
raised `:badkey`). Compiled closures should never carry stale cell
refs in practice, but the divergent error shape was a real
contract gap. Pinned with a synthetic-prototype test that forges
a dangling ref and asserts nil out of both paths.
Dead-code cleanup:
- Removed the `:source_line` encoder clause and dispatcher case.
`encode_list/2` strips `:source_line` upstream, so neither was
reachable. ~5% benchmark uplift from the strip is documented as
the durable result.
- Removed `:test_true` end-to-end (Instruction constructor,
encoder clause, encoder accessor, dispatcher case, and the
`@op_test_true 25` constants in both modules — left a reusable
comment-only hole). Codegen always emits two-armed `:test` even
for `if x then ... end` (no else), so the one-armed variant was
never reachable.
- Removed the `is_vararg` branch in dispatcher `:call_one`. Vararg
bodies are encoded-out (`:vararg` / `:return_vararg` fall to
`:fallback`), so a `{:compiled_closure, ...}` is by construction
never a vararg function. `collect_varargs/4` (only used there) is
gone with it.
Regression guardrail:
- New `Lua.Compiler.MaxRegistersInvariantTest` walks every encoded
bytecode tuple in a representative corpus and asserts each
register operand index is `< proto.max_registers`. With the
+16 multi-return buffer removed in fa5f657, `max_registers`
accuracy became load-bearing for the dispatcher — any future
codegen change that misses `record_peak/1` at a downward
`next_reg` reset will trip this test instead of crashing the
dispatcher with `:badarg` at runtime.
Deferred:
- Dispatcher `:call_one` does not push to `state.call_stack`. This
truncates `debug.traceback/0` and the stack-trace section of
`RuntimeError` / `TypeError` / `ArgumentError` for compiled-to-
compiled call chains. Folded into B5d-v2 (dispatcher error
position fidelity), which already has to thread per-instruction
line info — `call_stack` shares that machinery.
No action:
- "Two-tag closure routing is verbose" — reviewer acknowledged as
acceptable.
- "1.17x vs 1.2x perf target" — already addressed in fa5f657
(now 1.43x median on fib(25), 2.55x less memory). Documented in
PR description.
- "`bound data` only used in one arm" — reviewer marked harmless;
the explicit `data` binding feeds the inner case-match.
Validation:
mix format --check-formatted pass
mix compile --warnings-as-errors pass
mix test 1758 tests, 0 failures, 30 skipped
mix test --only lua53 29 tests, 0 failures, 23 skipped
Plan: B5a-v2.
6b6e84c to
9a31592
Compare
Dispatcher foundation — single hand-written executor over dense bytecode
Plan:
.agents/plans/B5a-v2-dispatcher-foundation.mdGoal
Land the foundation for B5's new approach: a single hand-written dispatcher
module that interprets a dense bytecode representation of
%Prototype{}values.No runtime BEAM module generation. No atoms minted per compile. No
:compile.forms, no:code.load_binary. Same dispatch shape as the BEAM'sstandard case-jump-table idiom.
Scope mirrors the original B5a: arithmetic, comparison, logical ops,
conditional
:test, single-result:call, single-value:return, plus thecommon
_ENV.namelookup path. Tables, closures, multi-return, loops, varargsall fall back to the existing list-of-tuples interpreter.
Success criteria
Lua.VM.Dispatchermodule exists atlib/lua/vm/dispatcher.ex, hand-written, exports
execute(proto, args, state)returning{results, state}. Single recursive function with one case branch peropcode integer.
Lua.Compiler.Bytecodemodule exists atlib/lua/compiler/bytecode.ex,walks a
%Prototype{}and produces{:ok, bytecode_tuple}or:fallback.%Prototype{}gains abytecode :: tuple() | nilfield. Set when thebytecode compiler accepts the prototype, nil otherwise.
Lua.VM.Executor.call_function/3learns a clause for{:compiled_closure, proto, upvalues}that dispatches toLua.VM.Dispatcher.execute/4.:callopcode in the interpreter learns the same shortcut for:compiled_closurecallees.:load_constant,:load_boolean,:load_nil,:move,:get_upvalue,:get_global,:load_env,:get_field,:add,:subtract,:multiply,:divide,:floor_divide,:modulo,:power,:negate,:less_than,:less_equal,:greater_than,:greater_equal,:equal,:not_equal,:not,:test,:test_true,:call(single result),:return(single value), and:source_line(stripped at encode timesince dispatcher line tracking is deferred to B5d-v2).
:fallbackfrom the bytecode compiler; theprototype stays interpreted. No crashes.
mix test: 1705 → 1749 tests (44 new), 0 failures, 51 properties,55 doctests.
mix test --only lua53: 29 tests, 0 failures.Lua.eval!calls growatom count by <50 and module count by <20 (test-runtime variance).
asked for ≥1.2x. Stretch goal (Luerl parity ±10%): met — fib(30)
beats Luerl on a good run.
Performance
benchmarks/dispatcher_vs_interpreter.exs(added) compares dispatcher vs.interpreter on the same VM state, with
proto.bytecodestripped to force theinterpreter path. fib(25), Benchee full mode (10s window, median of three runs):
fib(30) against Luerl (
benchmarks/fibonacci.exs):(Run-to-run variance puts us anywhere from 5% ahead to 5% behind Luerl on fib(30).
The plan called for parity ±10%, which we hit.)
Profile attribution after all optimization passes
Dispatcher.dispatch/8: 50% (the case-jump-table):erlang.setelement/3: 30% (register writes — unavoidable)copy_regs/5+init_callee_regs/4: 9% (call setup tuple allocation)return_one/3: 4% (frame unwinding)Further gains require structural changes explicitly out of scope: mutable
register storage, flat-PC bytecode with label resolution, or direct-threaded
dispatch. Each becomes its own follow-up plan if Direction B continues.
Optimization iterations log (1.05x → 1.17x)
dispatch/8+step/9chain).step/9intodispatch/8: 1.09x.return_one/3: 1.09x.:source_linefrom bytecode: 1.15x (~5% — 228k fewer dispatches on fib(25)).Changes
Discoveries
The plan was drafted against a flat-IR mental model. Five mismatches with the
actual codebase were worked around without scope expansion:
:testcarries nested instruction lists;loops use CPS continuation markers. Adapted: bytecode for
:testcarriesnested bytecode sub-tuples; dispatcher pushes
{code, pc}resume pointsonto a local continuation stack.
encoded shape preserves this.
:returnis{base, count},:callis5-tuple,
:load_envcarriesdest,:source_linecarriesfile.:scopeis vestigial — never emitted. Bytecode encoder matches the realshapes.
proto.subprotosis namedprototypes. Used the real field name.:source_linestrip. Removed from bytecode (no-op dispatch cost ~5%on fib). Original instruction stream untouched; interpreter error
reporting still works for non-compiled prototypes.
The
:compiled_closurevalue tag has more touch points than expected — ~12sites across
Executor,Value, stdlib, display, and the public API neededparallel pattern-match clauses. A future refactor could collapse the two tags
into a single
:lua_closurewithproto.bytecode != nilflagging dispatcherrouting. Left as-is for B5a-v2 since the explicit tag keeps the routing
decision local to
call_function/3.See
Discoveriessection of the plan for the full per-iteration profile loop.Verification
Out of scope (intentional)
boundary mid-execution — it's all-or-nothing per prototype)
Reviewer note: perf gate decision
The plan's risk section says "If the dispatcher does not beat the current
interpreter by at least 1.2x on fib(25), the whole Direction-B premise is
wrong and we should redirect to data-shape work (B6/B7) instead."
We sit at 1.17x median, brushing 1.2x on some runs. Strictly the gate is
not met. But:
leaks, no per-prototype modules, no
:compile.formslifecycle hazard —holds, and is enforced by the new leak regression suite.
(mutable registers, flat PC, direct-threaded dispatch), each its own
plan with bounded risk.
My recommendation: treat this as a soft pass — ship as the foundation,
proceed cautiously into B5b-v2 (tables) and B5c-v2 (closures) with another
gate check after each. If neither moves the needle on tables/closures
workloads, then redirect to B6/B7.
Open to overriding.