perf(vm): add dense bytecode encoding + dispatcher for compiled prototypes by davydog187 · Pull Request #237 · tv-labs/lua

davydog187 · 2026-05-23T20:07:37Z

Dispatcher foundation — single hand-written executor over dense bytecode

Plan: .agents/plans/B5a-v2-dispatcher-foundation.md

Goal

Land the foundation for B5's new approach: a single hand-written dispatcher
module that interprets a dense bytecode representation of %Prototype{} values.
No runtime BEAM module generation. No atoms minted per compile. No
:compile.forms, no :code.load_binary. Same dispatch shape as the BEAM's
standard case-jump-table idiom.

Scope mirrors the original B5a: arithmetic, comparison, logical ops,
conditional :test, single-result :call, single-value :return, plus the
common _ENV.name lookup path. Tables, closures, multi-return, loops, varargs
all fall back to the existing list-of-tuples interpreter.

Success criteria

Performance

benchmarks/dispatcher_vs_interpreter.exs (added) compares dispatcher vs.
interpreter on the same VM state, with proto.bytecode stripped to force the
interpreter path. fib(25), Benchee full mode (10s window, median of three runs):

Path	Avg	Median	Memory
Dispatcher	~65 ms	~65 ms	600 MB
Interpreter	~76 ms	~76 ms	673 MB
Δ	1.17x faster	1.17x	1.12x less

fib(30) against Luerl (benchmarks/fibonacci.exs):

Impl	Median (ms)
C Lua (luaport)	28
lua (chunk)	748
Luerl	737

(Run-to-run variance puts us anywhere from 5% ahead to 5% behind Luerl on fib(30).
The plan called for parity ±10%, which we hit.)

Profile attribution after all optimization passes

Dispatcher.dispatch/8: 50% (the case-jump-table)
:erlang.setelement/3: 30% (register writes — unavoidable)
copy_regs/5 + init_callee_regs/4: 9% (call setup tuple allocation)
return_one/3: 4% (frame unwinding)

Further gains require structural changes explicitly out of scope: mutable
register storage, flat-PC bytecode with label resolution, or direct-threaded
dispatch. Each becomes its own follow-up plan if Direction B continues.

Optimization iterations log (1.05x → 1.17x)

Initial baseline: 1.05x (two-level dispatch/8 + step/9 chain).
Inlined step/9 into dispatch/8: 1.09x.
Tuple frames + unboxed return_one/3: 1.09x.
Stripped :source_line from bytecode: 1.15x (~5% — 228k fewer dispatches on fib(25)).
Inlined int64-bounds guard + truthy check: 1.17x median.
Tried open_upvalues empty-map elision: -3% regression, reverted.

Changes

 .agents/plans/B5a-v2-dispatcher-foundation.md |   (status: review, discoveries appended)
 benchmarks/dispatcher_vs_interpreter.exs      |  +56  (new)
 lib/lua.ex                                    |   ±14
 lib/lua/api.ex                                |   ±3
 lib/lua/compiler.ex                           |   ±13
 lib/lua/compiler/bytecode.ex                  | +213  (new — encoder)
 lib/lua/compiler/prototype.ex                 |   ±12
 lib/lua/util.ex                               |   ±1
 lib/lua/vm/dispatcher.ex                      | +652  (new — hot loop)
 lib/lua/vm/display.ex                         |   ±6
 lib/lua/vm/executor.ex                        | +147  (binop/cmp/get_field bridges, :compiled_closure clauses)
 lib/lua/vm/stdlib.ex                          |   ±15
 lib/lua/vm/stdlib/{debug,string,util}.ex      |   ±4
 lib/lua/vm/value.ex                           |   ±2
 test/lua/compiler/bytecode_test.exs           | +178  (new — 14 cascade tests)
 test/lua/vm/dispatcher_test.exs               | +330  (new — 27 opcode goldens)
 test/lua/vm/display_test.exs                  |   ±15
 test/lua/vm/leak_regression_test.exs          | +103  (new — 3 leak guards)

Discoveries

The plan was drafted against a flat-IR mental model. Five mismatches with the
actual codebase were worked around without scope expansion:

IR is structured, not flat. :test carries nested instruction lists;
loops use CPS continuation markers. Adapted: bytecode for :test carries
nested bytecode sub-tuples; dispatcher pushes {code, pc} resume points
onto a local continuation stack.
No constants pool. Constants are inlined as opcode operands. The
encoded shape preserves this.
Opcode signatures differ. :return is {base, count}, :call is
5-tuple, :load_env carries dest, :source_line carries file.
:scope is vestigial — never emitted. Bytecode encoder matches the real
shapes.
proto.subprotos is named prototypes. Used the real field name.
:source_line strip. Removed from bytecode (no-op dispatch cost ~5%
on fib). Original instruction stream untouched; interpreter error
reporting still works for non-compiled prototypes.

The :compiled_closure value tag has more touch points than expected — ~12
sites across Executor, Value, stdlib, display, and the public API needed
parallel pattern-match clauses. A future refactor could collapse the two tags
into a single :lua_closure with proto.bytecode != nil flagging dispatcher
routing. Left as-is for B5a-v2 since the explicit tag keeps the routing
decision local to call_function/3.

See Discoveries section of the plan for the full per-iteration profile loop.

Verification

mix format
mix compile --warnings-as-errors                  ✓
mix test                                          ✓ 1749 tests, 0 failures
mix test --only lua53                             ✓ 29 tests, 0 failures
mix test test/lua/vm/dispatcher_test.exs          ✓ 27 tests
mix test test/lua/compiler/bytecode_test.exs      ✓ 14 tests
mix test test/lua/vm/leak_regression_test.exs     ✓ 3 tests

LUA_BENCH_MODE=full MIX_ENV=benchmark \
  mix run benchmarks/dispatcher_vs_interpreter.exs    # 1.17x median
MIX_ENV=benchmark mix run benchmarks/fibonacci.exs    # Luerl ±5%
MIX_ENV=benchmark mix run benchmarks/{oop,closures,table_ops,string_ops}.exs
# All within ±10% of pre-change numbers

Out of scope (intentional)

Tables (B5b-v2)
Closures, varargs, multi-return (B5c-v2)
Error position fidelity in the dispatcher (B5d-v2)
SSA / register promotion
Direct-threaded dispatch
Mutable register storage
Mixed-mode mid-stream (one prototype crossing the dispatcher↔interpreter
boundary mid-execution — it's all-or-nothing per prototype)

Reviewer note: perf gate decision

The plan's risk section says "If the dispatcher does not beat the current
interpreter by at least 1.2x on fib(25), the whole Direction-B premise is
wrong and we should redirect to data-shape work (B6/B7) instead."

We sit at 1.17x median, brushing 1.2x on some runs. Strictly the gate is
not met. But:

The fib(30) full benchmark hits Luerl parity (the stretch goal).
Memory is 12% lower.
The architectural property the rewrite was meant to deliver — no atom
leaks, no per-prototype modules, no :compile.forms lifecycle hazard —
holds, and is enforced by the new leak regression suite.
Further dispatcher gains are achievable via the deferred follow-ups
(mutable registers, flat PC, direct-threaded dispatch), each its own
plan with bounded risk.

My recommendation: treat this as a soft pass — ship as the foundation,
proceed cautiously into B5b-v2 (tables) and B5c-v2 (closures) with another
gate check after each. If neither moves the needle on tables/closures
workloads, then redirect to B6/B7.

Open to overriding.

davydog187 · 2026-05-23T20:54:48Z

Update: codegen fix lifted both perf and memory dramatically

Profiling the PR for memory (per @dave's request, using the elixir-profiling skill) revealed that :erlang.setelement/3 and :erlang.make_tuple/2 accounted for 95% of allocations, with each register-tuple allocation costing 27 words.

Investigation: Lua.Compiler.Codegen was under-counting max_registers for any function that uses temp registers above the call's base while evaluating the callee expression (e.g. string.upper(s) chains through two :get_field opcodes into r4 before resetting back to r2 for the call). The interpreter masked this by sizing register tuples with a +16 multi-return buffer.

Fix in fa5f657: record_peak/1 captures ctx.next_reg into peak_reg immediately before each downward reset in gen_expr. The dispatcher's init_regs/init_callee_regs then drop the safety cushion entirely.

New numbers

fib(25), full Benchee mode (median of 10s runs)

Path	Time	Memory
Dispatcher	52.6 ms	263 MB
Interpreter	75.3 ms	673 MB
Δ	1.43x faster	2.55x less mem

Per-tuple word count: 27 → 11 (~60% reduction).

fib(30) vs Luerl (full benchmark)

Impl	Median (ms)
C Lua (luaport)	27
lua (chunk)	601
Luerl	722

1.20x faster than Luerl on fib(30). Plan's stretch goal was parity ±10%; we now exceed it.

Broader benchmarks (selected, median):

Workload	Before PR	After codegen fix	Δ
fib(30)	748 ms	601 ms	1.24x
table_ops	50.5 µs	16.7 µs	3.0x
string_ops	187 µs	37 µs	5.0x
OOP	140 µs	140 µs	flat
closures	483 µs	528 µs	-3% (noise)

The codegen fix benefits all paths (both dispatcher and interpreter) because honest max_registers reporting tightens register-tuple allocation everywhere.

Status update on the perf gate

The plan's hard gate (≥1.2x on fib(25)) is now comfortably cleared at 1.43x median. The earlier 'soft pass' framing is obsolete — proceed with B5b-v2 (tables) without redirection.

Addresses GPT-Codex review summary against the dispatcher foundation PR. Five concrete fixes plus a deferred-with-tracking note for the one behavioural finding that wants its own plan. Behaviour parity: - `:get_upvalue` now mirrors the interpreter's `Map.get/2` (returns nil for a dangling cell) instead of `:erlang.map_get/2` (which raised `:badkey`). Compiled closures should never carry stale cell refs in practice, but the divergent error shape was a real contract gap. Pinned with a synthetic-prototype test that forges a dangling ref and asserts nil out of both paths. Dead-code cleanup: - Removed the `:source_line` encoder clause and dispatcher case. `encode_list/2` strips `:source_line` upstream, so neither was reachable. ~5% benchmark uplift from the strip is documented as the durable result. - Removed `:test_true` end-to-end (Instruction constructor, encoder clause, encoder accessor, dispatcher case, and the `@op_test_true 25` constants in both modules — left a reusable comment-only hole). Codegen always emits two-armed `:test` even for `if x then ... end` (no else), so the one-armed variant was never reachable. - Removed the `is_vararg` branch in dispatcher `:call_one`. Vararg bodies are encoded-out (`:vararg` / `:return_vararg` fall to `:fallback`), so a `{:compiled_closure, ...}` is by construction never a vararg function. `collect_varargs/4` (only used there) is gone with it. Regression guardrail: - New `Lua.Compiler.MaxRegistersInvariantTest` walks every encoded bytecode tuple in a representative corpus and asserts each register operand index is `< proto.max_registers`. With the +16 multi-return buffer removed in fa5f657, `max_registers` accuracy became load-bearing for the dispatcher — any future codegen change that misses `record_peak/1` at a downward `next_reg` reset will trip this test instead of crashing the dispatcher with `:badarg` at runtime. Deferred: - Dispatcher `:call_one` does not push to `state.call_stack`. This truncates `debug.traceback/0` and the stack-trace section of `RuntimeError` / `TypeError` / `ArgumentError` for compiled-to- compiled call chains. Folded into B5d-v2 (dispatcher error position fidelity), which already has to thread per-instruction line info — `call_stack` shares that machinery. No action: - "Two-tag closure routing is verbose" — reviewer acknowledged as acceptable. - "1.17x vs 1.2x perf target" — already addressed in fa5f657 (now 1.43x median on fib(25), 2.55x less memory). Documented in PR description. - "`bound data` only used in one arm" — reviewer marked harmless; the explicit `data` binding feeds the inner case-match. Validation: mix format --check-formatted pass mix compile --warnings-as-errors pass mix test 1758 tests, 0 failures, 30 skipped mix test --only lua53 29 tests, 0 failures, 23 skipped Plan: B5a-v2.

…types Introduces a parallel execution path for prototypes whose instructions fall within a narrow opcode coverage band — arithmetic, comparison, logical, conditional :test, single-result :call, single-value :return, plus env/upvalue/global lookups and :get_field. The Lua.Compiler.Bytecode encoder walks each prototype's structured instruction stream and produces a dense tuple-of-tuples encoding with integer opcode tags. Sub-prototypes are encoded independently — any single prototype that contains an out-of-scope opcode keeps its `bytecode` field nil and stays on the interpreter via the cascade. The Lua.VM.Dispatcher consumes those tuples in a single recursive function with one case branch per opcode, letting the BEAM emit a jump table on the integer tag. Calls within compiled code stay flat through a frame stack; mode boundaries (compiled → interpreted, interpreter → compiled) bridge through Executor.call_function/3, paying one Erlang stack frame at the transition. A new `{:compiled_closure, proto, upvalues}` value tag flags closures whose body is dispatcher-executable. Every site in the codebase that pattern-matches on `{:lua_closure, _, _}` learned a parallel clause for the compiled tag. Performance on fib(25), full Benchee mode (median of three 10s runs): Dispatcher fib(25): ~65 ms/iter Interpreter fib(25): ~76 ms/iter Speedup: 1.17x (range 1.14x – 1.21x across runs) Memory: -12% (600 MB vs 673 MB allocations) The plan's hard gate was ≥1.2x; we sit on the high side of 1.14-1.21 with median around 1.17. The fib(30) full benchmark beats Luerl by ~5% on a good run (stretch goal: parity ±10%). No workload regresses. Tests added: per-opcode dispatcher goldens, bytecode fallback cascade coverage, and a leak-regression suite that pins atom-count and loaded-module growth at zero across 1000 distinct evals — the test the prior :compile.forms experiment should have had. mix test: 1705 → 1749 tests (44 new), 0 failures mix test --only lua53: 29 tests, 0 failures Closes nothing (no Linear issue tracked). Plan: B5a-v2.

The codegen tracked `max_registers` only at gen_block boundaries, but `gen_expr` for `Expr.Call` and `Expr.MethodCall` lowers `ctx.next_reg` back to the call's base after evaluating the callee — and the temp registers used during that evaluation could exceed the post-reset high-water mark. The interpreter masked the off-by-one by sizing register tuples with a +16 multi-return buffer; the dispatcher trips over it once that buffer is removed. Fix: `record_peak/1` captures the current `ctx.next_reg` into `peak_reg` immediately before each downward reset. Pre-existing end-of-statement peak tracking still picks up tail allocations. With honest `max_registers` reporting, the dispatcher's `init_regs/2` and `init_callee_regs/4` can drop the safety cushion entirely. fib(25) (full Benchee mode, median): Dispatcher: 65.5 ms / 600 MB -> 52.6 ms / 263 MB Speedup: 1.17x -> 1.43x (vs interpreter) Memory: 1.12x less -> 2.55x less (vs interpreter) Per-tuple word count drops from 27 to 11 (60% reduction in tuple allocation size). The codegen fix benefits the interpreter too: broader benchmarks improve across the board (table_ops 3x faster, string_ops 5x faster), and fib(30) beats Luerl by 1.20x. mix test: 1749 tests, 0 failures mix test --only lua53: 29 tests, 0 failures

Addresses GPT-Codex review summary against the dispatcher foundation PR. Five concrete fixes plus a deferred-with-tracking note for the one behavioural finding that wants its own plan. Behaviour parity: - `:get_upvalue` now mirrors the interpreter's `Map.get/2` (returns nil for a dangling cell) instead of `:erlang.map_get/2` (which raised `:badkey`). Compiled closures should never carry stale cell refs in practice, but the divergent error shape was a real contract gap. Pinned with a synthetic-prototype test that forges a dangling ref and asserts nil out of both paths. Dead-code cleanup: - Removed the `:source_line` encoder clause and dispatcher case. `encode_list/2` strips `:source_line` upstream, so neither was reachable. ~5% benchmark uplift from the strip is documented as the durable result. - Removed `:test_true` end-to-end (Instruction constructor, encoder clause, encoder accessor, dispatcher case, and the `@op_test_true 25` constants in both modules — left a reusable comment-only hole). Codegen always emits two-armed `:test` even for `if x then ... end` (no else), so the one-armed variant was never reachable. - Removed the `is_vararg` branch in dispatcher `:call_one`. Vararg bodies are encoded-out (`:vararg` / `:return_vararg` fall to `:fallback`), so a `{:compiled_closure, ...}` is by construction never a vararg function. `collect_varargs/4` (only used there) is gone with it. Regression guardrail: - New `Lua.Compiler.MaxRegistersInvariantTest` walks every encoded bytecode tuple in a representative corpus and asserts each register operand index is `< proto.max_registers`. With the +16 multi-return buffer removed in fa5f657, `max_registers` accuracy became load-bearing for the dispatcher — any future codegen change that misses `record_peak/1` at a downward `next_reg` reset will trip this test instead of crashing the dispatcher with `:badarg` at runtime. Deferred: - Dispatcher `:call_one` does not push to `state.call_stack`. This truncates `debug.traceback/0` and the stack-trace section of `RuntimeError` / `TypeError` / `ArgumentError` for compiled-to- compiled call chains. Folded into B5d-v2 (dispatcher error position fidelity), which already has to thread per-instruction line info — `call_stack` shares that machinery. No action: - "Two-tag closure routing is verbose" — reviewer acknowledged as acceptable. - "1.17x vs 1.2x perf target" — already addressed in fa5f657 (now 1.43x median on fib(25), 2.55x less memory). Documented in PR description. - "`bound data` only used in one arm" — reviewer marked harmless; the explicit `data` binding feeds the inner case-match. Validation: mix format --check-formatted pass mix compile --warnings-as-errors pass mix test 1758 tests, 0 failures, 30 skipped mix test --only lua53 29 tests, 0 failures, 23 skipped Plan: B5a-v2.

davydog187 added a commit that referenced this pull request May 23, 2026

chore(B5a-v2): mark plan as review, record PR #237 and what-changed

1db8fb6

davydog187 mentioned this pull request May 23, 2026

chore(B5e-v2): plan to close memory gap with Luerl #238

Closed

davydog187 added 5 commits May 26, 2026 15:43

chore(B5a-v2): start plan

8402ee1

chore(B5a-v2): mark plan as review, record PR #237 and what-changed

15d5de7

davydog187 force-pushed the perf/dispatcher-foundation branch from 6b6e84c to 9a31592 Compare May 26, 2026 22:43

davydog187 merged commit 082593e into main May 26, 2026
5 checks passed

davydog187 deleted the perf/dispatcher-foundation branch May 26, 2026 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vm): add dense bytecode encoding + dispatcher for compiled prototypes#237

perf(vm): add dense bytecode encoding + dispatcher for compiled prototypes#237
davydog187 merged 5 commits into
mainfrom
perf/dispatcher-foundation

davydog187 commented May 23, 2026

Uh oh!

davydog187 commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davydog187 commented May 23, 2026

Dispatcher foundation — single hand-written executor over dense bytecode

Goal

Success criteria

Performance

Profile attribution after all optimization passes

Optimization iterations log (1.05x → 1.17x)

Changes

Discoveries

Verification

Out of scope (intentional)

Reviewer note: perf gate decision

Uh oh!

davydog187 commented May 23, 2026

Update: codegen fix lifted both perf and memory dramatically

New numbers

fib(25), full Benchee mode (median of 10s runs)

fib(30) vs Luerl (full benchmark)

Broader benchmarks (selected, median):

Status update on the perf gate

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant