Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 143 additions & 3 deletions .agents/plans/B5a-v2-dispatcher-foundation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
id: B5a-v2
title: Dispatcher foundation — single hand-written executor over dense bytecode
issue: null
pr: null
pr: 237
branch: perf/dispatcher-foundation
base: main
status: ready
status: review
direction: B
unlocks:
- B5b-v2 (table opcodes), B5c-v2 (closures), B5d-v2 (error fidelity)
Expand Down Expand Up @@ -399,4 +399,144 @@ MIX_ENV=benchmark mix run benchmarks/string_ops.exs

## Discoveries

(Will be filled in during implementation.)
### IR shape diverges from plan

The plan was drafted against a mental model of a flat instruction stream
with absolute PC labels and a separate constants pool. The actual IR is
**structured**: `:test` carries nested instruction lists for then/else
branches, loops use CPS continuation markers (not PC jumps), and
constants are inlined directly into opcodes (no pool, no `k_idx`).

Adapted the design accordingly: the bytecode is a tuple of opcode tuples
where `:test` recursively carries nested bytecode sub-tuples. The
dispatcher pushes `{code, pc}` resume points onto a local continuation
stack when entering a branch body, mirroring the interpreter's pattern.
No PC label resolution machinery needed.

### Several plan opcode signatures were stale

- `:return` is `{:return, base, count}`, not `{:return_one, base}`.
- `:call` is 5-tuple with `name_hint`, not 3-tuple.
- `:load_env` carries `dest`, not zero operands.
- `:source_line` is `{:source_line, line, file}`, not just `{line}`.
- `:scope` is listed in coverage but never emitted by the current
codegen — it's vestigial in `Lua.Compiler.Instruction`.

The bytecode encoder matches the actual shapes. `:scope` was dropped
from coverage as a no-op.

### `proto.subprotos` field is named `prototypes`

The plan called it `subprotos` throughout. The actual struct field is
`prototypes`. Bytecode compilation walks `proto.prototypes` and stores
encoded children back in the same field.

### `:source_line` opcodes stripped from bytecode

Keeping them in the dense encoding cost one no-op dispatch per source
line, ~5% on fib(25). Stripped at encode time. Error attribution for
compiled prototypes is deferred to B5d-v2 anyway, so the
instruction-stream `:source_line` entries (used by the interpreter for
error positions) survive untouched on the prototype.

### Perf gate is brushed, not robustly cleared

Final measurements on fib(25) (full Benchee mode, median of 10s runs):

- Dispatcher: ~65 ms/iter
- Interpreter (same VM, bytecode stripped): ~76 ms/iter
- **Speedup: 1.17x median** (range 1.14x – 1.21x across runs, ~1.5% deviation)

The plan's gate was ≥1.2x. We sit between 1.14 and 1.21, with the
median around 1.17. fib(30) full benchmark beats Luerl by ~5% on a good
run (stretch goal: parity ±10%). No workload regresses.

Why we didn't hit a clean 1.2x: the interpreter is already heavily
tuned (per-clause guards, inlined integer fast paths, dedicated
`{:return, _, 1}` fast clause). The dispatcher's wins — integer-tagged
case dispatch, tuple-encoded operands, stripped `:source_line` — are
real but bounded by the interpreter's existing optimisations.

Profile attribution after all optimization passes:

- `Dispatcher.dispatch/8`: 50% (the case-jump-table itself)
- `:erlang.setelement/3`: 30% (register writes — unavoidable)
- `copy_regs/5` + `init_callee_regs/4`: 9% (call setup tuple allocation)
- `return_one/3`: 4% (frame unwinding)

Further gains require structural changes explicitly out of scope:

- Mutable register storage (`:array`/process dict) would eliminate
`setelement/3` allocations entirely.
- Flat PC bytecode with label resolution would let `:test` skip the
continuation-stack push.
- Direct-threaded dispatch (computed-goto-equivalent) would replace
the case statement with token-driven jumps.

Each is its own follow-up plan.

### Optimization iterations log

For reproducibility — the perf loop that got us from 1.05x to 1.17x:

1. **Initial baseline:** 1.05x (dispatch/8 + step/9 two-level chain).
2. **Inlined `step/9` into `dispatch/8`:** 1.09x (eliminated one call frame per opcode).
3. **Tuple frames + unboxed `return_one/3`:** 1.09x (skips `[v]` allocation on return).
4. **Stripped `:source_line` from bytecode:** 1.15x (~5% win — 228k dispatches saved on fib(25)).
5. **Inlined int64-bounds guard + truthy check:** 1.17x median (eliminated `Numeric.to_signed_int64` and `Value.truthy?` function calls in hot paths).
6. **Tried open_upvalues empty-map elision:** -3% regression, reverted.

### `:compiled_closure` plumbing has more touch points than expected

Every site in the codebase that pattern-matches on `{:lua_closure, _, _}`
needed a parallel clause for `{:compiled_closure, _, _}`:

- `Lua.VM.Executor.call_function/3`, `:call` opcode, `:closure` opcode, `invoke_metamethod`, `call_value`, `value_type`
- `Lua.VM.Value.type_name`, `to_string`
- `Lua.VM.Stdlib.lua_load`, `compile_loaded_chunk`
- `Lua.VM.Stdlib.Util.typeof`
- `Lua.VM.Stdlib.String` (gsub repl)
- `Lua.VM.Stdlib.Debug.getinfo`
- `Lua.VM.Display.wrap_value`, `wrap_closure`
- `Lua.Util.encoded?`
- `Lua.Api.is_lua_func` guard
- `Lua.do_call_function`

Tests that asserted on the specific `:lua_closure` tag (display tests,
unwrap doctest) had to learn that closures may now be either tag.

This was a real cost. A future refactor could collapse the two tags
into one (`{:lua_closure, proto, upvalues}` where `proto.bytecode != nil`
implies dispatcher routing) — but the explicit tag makes the routing
decision local to `call_function/3` and that's worth something.

### Tests added

- `test/lua/vm/dispatcher_test.exs` — 27 per-opcode goldens.
- `test/lua/compiler/bytecode_test.exs` — 14 fallback cascade tests.
- `test/lua/vm/leak_regression_test.exs` — 3 leak guards (atom count
growth, module load growth, bytecode-is-tuple shape).

Total: +44 tests, 1705 → 1749, 0 failures.

## What changed

- New: `lib/lua/compiler/bytecode.ex` (encoder),
`lib/lua/vm/dispatcher.ex` (hand-written executor),
`benchmarks/dispatcher_vs_interpreter.exs` (perf comparison harness),
`test/lua/compiler/bytecode_test.exs`,
`test/lua/vm/dispatcher_test.exs`,
`test/lua/vm/leak_regression_test.exs`.
- Modified: `lib/lua/compiler.ex` (wires bytecode encoder into compile
pipeline), `lib/lua/compiler/prototype.ex` (adds `bytecode` field),
`lib/lua/vm/executor.ex` (adds `:compiled_closure` clauses to
`call_function/3`, `:call` opcode, `:closure` opcode; adds
`dispatcher_*` bridge helpers for arithmetic/comparison/field access),
`lib/lua.ex`, `lib/lua/api.ex`, `lib/lua/util.ex`,
`lib/lua/vm/{display,value}.ex`,
`lib/lua/vm/stdlib/{debug,string,util}.ex`,
`lib/lua/vm/stdlib.ex` (all gain parallel `:compiled_closure` clauses).
- Tests: `test/lua/vm/display_test.exs` updated to accept either
closure tag.

PR: https://github.com/tv-labs/lua/pull/237
14 changes: 11 additions & 3 deletions .agents/plans/B5d-v2-dispatcher-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,19 @@ cheaper inside a single dispatcher than across a generated-module
boundary (the original B5e's plan) because no cross-module
unwinding is needed.

In addition to per-prototype line info, `:call_one` in the
dispatcher must push a `call_info` frame onto `state.call_stack`
(matching the interpreter's `executor.ex:777` semantics) and the
return path must pop it. Without this, `debug.traceback/0` from a
compiled callee and the stack-trace section of any
`RuntimeError` / `TypeError` / `ArgumentError` raised from a
compiled-to-compiled call chain will silently truncate to the
caller of `Dispatcher.execute/3-4`. The current B5a-v2 PR ships
the missing frames as a known gap so the perf win can land first.
See [PR #237 review summary](https://github.com/tv-labs/lua/pull/237).

## Out of scope

- Stack-trace shape for compiled-to-compiled call chains. The
per-call shape from the interpreter survives — `call_function/3`
already carries position context.
- Source-map formats compatible with external debuggers. Not in
scope for this rewrite.

Expand Down
57 changes: 57 additions & 0 deletions benchmarks/dispatcher_vs_interpreter.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Compares the dispatcher vs interpreter on the same fib(25) workload.
# Strips `proto.bytecode` to force the interpreter path on an otherwise-
# identical Lua VM state. Used by perf-gate verification for B5a-v2.

Code.require_file("helpers.exs", __DIR__)

fib_def = """
function fib(n)
if n < 2 then return n end
return fib(n-1) + fib(n-2)
end
"""

# Compile once, get a clean state with `fib` installed as a global.
lua_dispatcher = Lua.new() |> Lua.eval!(fib_def) |> elem(1)

strip_bytecode = fn walker, %Lua.Compiler.Prototype{} = p ->
%{p | bytecode: nil, prototypes: Enum.map(p.prototypes, &walker.(walker, &1))}
end

# Strip bytecode from fib so the call routes through the interpreter.
strip_state = fn state ->
case Lua.VM.State.get_global(state, "fib") do
{:compiled_closure, proto, upvalues} ->
stripped = strip_bytecode.(strip_bytecode, proto)
Lua.VM.State.set_global(state, "fib", {:lua_closure, stripped, upvalues})

{:lua_closure, proto, upvalues} ->
stripped = strip_bytecode.(strip_bytecode, proto)
Lua.VM.State.set_global(state, "fib", {:lua_closure, stripped, upvalues})
end
end

lua_interpreter = %{lua_dispatcher | state: strip_state.(lua_dispatcher.state)}

IO.puts("\n--- closure tags ---")
{:compiled_closure, _, _} = Lua.VM.State.get_global(lua_dispatcher.state, "fib")
{:lua_closure, _, _} = Lua.VM.State.get_global(lua_interpreter.state, "fib")
IO.puts("dispatcher: :compiled_closure")
IO.puts("interpreter: :lua_closure")

# Correctness sanity check.
{[result_d], _} = Lua.eval!(lua_dispatcher, "return fib(20)")
{[result_i], _} = Lua.eval!(lua_interpreter, "return fib(20)")
IO.puts("\nfib(20) dispatcher=#{result_d} interpreter=#{result_i} match=#{result_d == result_i}\n")

call_fib = "return fib(25)"
{chunk_d, _} = Lua.load_chunk!(lua_dispatcher, call_fib)
{chunk_i, _} = Lua.load_chunk!(lua_interpreter, call_fib)

Benchee.run(
%{
"dispatcher fib(25)" => fn -> Lua.eval!(lua_dispatcher, chunk_d) end,
"interpreter fib(25)" => fn -> Lua.eval!(lua_interpreter, chunk_i) end
},
Bench.opts()
)
14 changes: 12 additions & 2 deletions lib/lua.ex
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ defmodule Lua do
alias Lua.Util
alias Lua.VM.AssertionError
alias Lua.VM.Display
alias Lua.VM.Executor
alias Lua.VM.InternalError
alias Lua.VM.RuntimeError
alias Lua.VM.State
Expand Down Expand Up @@ -713,13 +714,22 @@ defmodule Lua do
end)

{results, _regs, new_state} =
Lua.VM.Executor.execute(proto.instructions, callee_regs, upvalues, proto, state)
Executor.execute(proto.instructions, callee_regs, upvalues, proto, state)

{:ok, results, new_state}
rescue
e -> {:error, Exception.message(e), state}
end

defp do_call_function({:compiled_closure, _, _} = closure, args, state) do
# Compiled callees route through the dispatcher; same observable
# contract as the interpreter branch above.
{results, new_state} = Executor.call_function(closure, args, state)
{:ok, results, new_state}
rescue
e -> {:error, Exception.message(e), state}
end

defp do_call_function(other, _args, state) do
{:error, "undefined function '#{inspect(other)}'", state}
end
Expand Down Expand Up @@ -757,7 +767,7 @@ defmodule Lua do
true

iex> {[c], _} = Lua.eval!(Lua.new(), "return function() end")
iex> match?({:lua_closure, _, _}, Lua.unwrap(c))
iex> match?({:lua_closure, _, _}, Lua.unwrap(c)) or match?({:compiled_closure, _, _}, Lua.unwrap(c))
true

iex> Lua.unwrap(42)
Expand Down
3 changes: 2 additions & 1 deletion lib/lua/api.ex
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,8 @@ defmodule Lua.API do
Is the value a reference to a Lua function?
"""
defguard is_lua_func(value)
when is_tuple(value) and tuple_size(value) == 3 and elem(value, 0) == :lua_closure
when is_tuple(value) and tuple_size(value) == 3 and
(elem(value, 0) == :lua_closure or elem(value, 0) == :compiled_closure)

@doc """
Is the value a reference to an Erlang / Elixir function?
Expand Down
13 changes: 11 additions & 2 deletions lib/lua/compiler.ex
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ defmodule Lua.Compiler do
"""

alias Lua.AST.Chunk
alias Lua.Compiler.Bytecode
alias Lua.Compiler.Codegen
alias Lua.Compiler.Prototype
alias Lua.Compiler.Scope
Expand All @@ -16,11 +17,19 @@ defmodule Lua.Compiler do

@doc """
Compiles a Lua AST chunk into a prototype.

After codegen, the prototype is offered to `Lua.Compiler.Bytecode` for
dense encoding. Sub-prototypes are encoded independently — the dispatcher
takes over per-prototype wherever every opcode in that prototype falls
within its coverage; anything else stays on the interpreter. The
original instruction stream is preserved either way, so error reporting
and tooling continue to work unchanged.
"""
@spec compile(Chunk.t(), compile_opts()) :: {:ok, Prototype.t()} | {:error, term()}
def compile(%Chunk{} = chunk, opts \\ []) do
with {:ok, scope_state} <- Scope.resolve(chunk, opts) do
Codegen.generate(chunk, scope_state, opts)
with {:ok, scope_state} <- Scope.resolve(chunk, opts),
{:ok, prototype} <- Codegen.generate(chunk, scope_state, opts) do
{:ok, Bytecode.compile(prototype)}
end
end

Expand Down
Loading
Loading