From 5705b7fc42eb4f0759179ae49dd9f7e42c772f6a Mon Sep 17 00:00:00 2001 From: Dave Lucia Date: Sat, 23 May 2026 14:26:25 -0700 Subject: [PATCH] chore(B5e-v2): add plan to close memory gap with Luerl MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures the follow-up identified during B5a-v2 review: dispatcher allocates 263 MB on fib(25) vs Luerl's 227 MB, with ~80% of the deficit attributable to :erlang.setelement/3 copying the register tuple on every opcode. Approach: replace immutable tuple regs with process-dict-backed mutable storage scoped to the dispatcher only. The interpreter and the bytecode format stay unchanged. Targets ≤130 MB on fib(25) (soft gate) and Luerl parity (stretch). Blocked on B5a-v2 (PR #237) landing. --- .../plans/B5e-v2-dispatcher-mutable-regs.md | 325 ++++++++++++++++++ 1 file changed, 325 insertions(+) create mode 100644 .agents/plans/B5e-v2-dispatcher-mutable-regs.md diff --git a/.agents/plans/B5e-v2-dispatcher-mutable-regs.md b/.agents/plans/B5e-v2-dispatcher-mutable-regs.md new file mode 100644 index 0000000..cfde0d7 --- /dev/null +++ b/.agents/plans/B5e-v2-dispatcher-mutable-regs.md @@ -0,0 +1,325 @@ +--- +id: B5e-v2 +title: Dispatcher mutable register storage — close the memory gap with Luerl +issue: null +pr: null +branch: perf/dispatcher-mutable-regs +base: main +status: blocked +direction: B +unlocks: + - parity with (or better than) Luerl on memory for compiled workloads + - removes ~80% of dispatcher allocations on fib(25) + - sets the stage for further dispatch-cycle wins (no tuple copy per opcode) +parent: B5-dispatcher-and-bytecode +--- + +## Blocked on + +- B5a-v2 (dispatcher foundation) — PR #237, in review. + +## Goal + +Replace the dispatcher's immutable register tuple with **mutable +process-dictionary-backed register storage**, eliminating the +`:erlang.setelement/3` allocation that currently accounts for ~80% of +dispatcher memory traffic. The dispatcher's hot path should read and +write registers without allocating a new tuple per opcode. + +Scope is **dispatcher-only**. The interpreter's tuple-based register +file is untouched — only prototypes that compile to bytecode benefit. +The interpreter remains the correctness reference; dispatcher diverges +only in *how* registers are stored, not in observable semantics. + +## Why now + +fib(25), full Benchee mode, after B5a-v2: + +| Path | Time | Memory | +|-------------|----------|----------| +| Dispatcher | 51.6 ms | 263 MB | +| Luerl | 64.5 ms | **227 MB** | +| Interpreter | 73.7 ms | 673 MB | + +We are 1.25x faster than Luerl on time but 1.16x heavier on memory. +The memory deficit traces to `:erlang.setelement/3` copying an 11-word +tuple on every register write. fib(22) executes ~600k register writes; +each copies 11 words, accounting for 16 MB of the 18 MB total dispatcher +allocations attributable to the register file. + +This is the largest single allocation source remaining in the +dispatcher, and the largest gap between us and Luerl. The plan's risk +section in B5a-v2 explicitly called out mutable register storage as +the follow-up for closing it. + +## Out of scope + +- Interpreter register file. Stays as tuple-of-tuples. Mutable storage + in the dispatcher is enough — out-of-scope opcodes still fall back to + the interpreter. +- NIF-backed mutable storage (`:atomics`, custom resource types). Too + heavy for one PR and incompatible with arbitrary Lua values + (`:atomics` is int64-only). +- ETS-backed registers. Cross-process visibility and table-creation + overhead per call wouldn't pay off. +- `:array` module. Same allocation profile as tuples — copy on write. +- Compile-time register lifetime analysis. Orthogonal — would help + *peak* register count, not per-write allocation. +- Register sharing between caller and callee (Luerl-style stack + splice). Larger structural change. + +## Success criteria + +- [ ] `Lua.VM.Dispatcher` stores its current-frame register file in + the **process dictionary** under a small fixed set of keys, + keyed by `{dispatcher_regs, reg_idx}` (or similar; the exact + key shape is a discovery during implementation). +- [ ] `:erlang.setelement/3` no longer appears in the dispatcher's + hot path. Verified via `mix profile.tprof --type memory` on + fib(22): combined `setelement` + `make_tuple` drop below 10% + of total dispatcher allocations. +- [ ] Call setup: callee register slots initialise via + `Process.put/2` (or a batched equivalent). Frame save on call + captures the *outgoing* register values into the dispatcher's + frame stack (a single map or list). Frame restore on return + writes them back. +- [ ] Closure capture (`:get_open_upvalue` / `:set_open_upvalue`) + reads the current process-dict register value when an upvalue + cell hasn't been allocated yet. Existing + `state.open_upvalues` semantics for created cells stay intact. +- [ ] All existing tests pass: `mix test` → 1749 tests + 51 properties + + 55 doctests, 0 failures. +- [ ] `mix test --only lua53` → 29 tests, 0 failures. +- [ ] Leak regression test still passes — process-dict keys are + cleared on dispatcher exit (`try/after` block). +- [ ] **Memory gate:** fib(25) dispatcher allocation drops by ≥50% + from current 263 MB → ≤130 MB. +- [ ] **Memory stretch:** fib(25) dispatcher allocation ≤ Luerl's + 227 MB. Parity with Luerl on memory. +- [ ] **Time:** fib(25) speedup vs interpreter improves to ≥1.5x + (currently 1.43x), or at minimum holds at 1.4x. No workload + regresses on time by more than 5%. +- [ ] **Concurrency safety:** every dispatcher invocation cleans + up its process-dict keys before returning, including the error + path (uncaught Lua exception bubbling out). A new test holds + this property: run 100 dispatcher invocations on the same + process, assert `Process.get_keys/0` size is unchanged before + and after. + +## Implementation notes + +### Storage shape + +The dispatcher needs: +- An array-like indexed slot store (registers). +- Cheap save/restore for the entire register file (on call frames). +- Cheap reset on error. + +The process dictionary offers `Process.put/2`, `Process.get/1`, +`Process.delete/1`. Each is O(1) hash-table access, allocation-free +for primitive values, allocation-equal-to-value for compound values +(no carrier tuple, unlike `:erlang.setelement/3`). + +Two layout options: + +**Option A — Flat key per slot:** + +```elixir +Process.put({:disp_reg, 0}, value) # write reg 0 +v = Process.get({:disp_reg, 5}) # read reg 5 +``` + +Each register slot is a separate process-dict key. Save/restore for a +frame requires reading N keys into a list, then writing them back. For +fib's 11-register file, that's 11 reads on call entry, 11 writes on +return. + +**Option B — Single carrier with `setelement`:** + +```elixir +regs = Process.get(:disp_regs) +regs = :erlang.setelement(N + 1, regs, value) +Process.put(:disp_regs, regs) +``` + +This still allocates the tuple — same problem we started with. Reject. + +**Option C — Tuple in process dict, mutate by replacing:** + +Same as B but only writes the carrier back on the final access of a +frame. Hard to know when "final" is. Reject. + +**Recommendation: Option A.** Per-slot keys. Each `setelement` becomes +`Process.put({:disp_reg, idx}, value)` — no carrier tuple at all. + +The complication: **frame save/restore** is no longer "save one tuple +pointer". On a `:call_one` we must snapshot all N caller registers into +the frame before resetting the slots for the callee. On return we +restore them. + +Mitigation: track `caller_regs_count` per call site in the bytecode +encoder. The encoder already knows `max_registers` at compile time, so +each `:call_one` can carry the exact number of slots to snapshot. + +Alternative mitigation: snapshot lazily — only save slots that the +callee actually writes. Too clever for v1; defer. + +### Frame save / restore + +The dispatcher's frame tuple (currently +`{code, pc, regs, upvalues, proto, cont, base, open_upvalues}`) +replaces `regs` with `saved_regs :: tuple()` — a one-time snapshot of +all active register slots at call time. On return we replay the snapshot +back into the process dict. + +```elixir +# On :call_one entering a compiled callee: +saved = snapshot_regs(caller_proto.max_registers) +Process.put({:disp_reg, 0}, arg0) +# ... copy args +frame = {code, pc + 1, saved, upvalues, proto, cont, base, open_upvalues} +# tail call into callee dispatch + +# On return: +restore_regs(saved) +Process.put({:disp_reg, base}, result) +dispatch(...) +``` + +`snapshot_regs/1` allocates one N-element tuple per call. That tuple +is the only per-call allocation. For fib with N=11, that's 11 words +per call frame — same as today's full register tuple. **The win is +that intra-body writes no longer allocate.** + +Net allocation count for fib(22): +- Today: 600k setelement × 11 words = 6.6M words (registers) + 57k call frames × 11 words = 0.6M words. Total: 7.2M words. +- After this PR: 57k call frames × 11 words = 0.6M words. **~92% reduction.** + +### Closure interaction — `:get_open_upvalue` / `:set_open_upvalue` + +These opcodes are currently `:fallback` in the bytecode encoder, so the +dispatcher doesn't handle them yet. **This PR does not change their +fallback status.** The interpreter handles them via its tuple-backed +register file as today. If a prototype touches open upvalues, it falls +back regardless of register storage. + +A future plan can extend dispatcher coverage to open upvalues, +which would require the dispatcher's `make_ref()`-cell creation logic +to read from the process-dict slot, then write the cell ref into +`state.open_upvalues`. Out of scope here. + +### Concurrency / reentry safety + +The process dict is shared across the calling Erlang process. If +`Lua.eval!` is called from a function the dispatcher invokes via +`:native_func` (an Elixir callback running `Lua.eval!` on the same +state), the nested dispatcher invocation would clobber the outer +frame's registers. + +Mitigation: at every dispatcher entry, save the current `{:disp_reg, *}` +key set (one tuple snapshot) and restore on exit: + +```elixir +def execute(proto, args, upvalues, state) do + saved = snapshot_all_disp_regs() + try do + do_execute_top(proto, args, upvalues, state) + after + restore_all_disp_regs(saved) + end +end +``` + +`snapshot_all_disp_regs/0` walks `Process.get_keys/0` filtering for +`{:disp_reg, _}` shape and reads each. One-time O(N) cost per +dispatcher entry. Acceptable because dispatcher entries are +order-of-magnitude rarer than per-opcode writes. + +Alternative: nested dispatcher invocations use a depth-prefixed key +(`{:disp_reg, depth, idx}`) with `depth` tracked in process dict. More +mechanism, no allocation savings — reject for v1. + +### Bytecode changes + +None. The bytecode tuple format is unchanged. Only `Lua.VM.Dispatcher` +changes internally. + +### Files + +- `lib/lua/vm/dispatcher.ex` — main rewrite. The `dispatch/8` case + arms change from `:erlang.setelement(dest + 1, regs, value)` and + `:erlang.element(src + 1, regs)` to `Process.put({:disp_reg, dest}, + value)` and `Process.get({:disp_reg, src})`. The `regs` parameter + is dropped from `dispatch/8` (down to `dispatch/7`). +- `test/lua/vm/dispatcher_test.exs` — existing per-opcode goldens + should pass unchanged. Add a reentry / process-dict-cleanup test. +- `test/lua/vm/leak_regression_test.exs` — extend with the + process-dict-key leak guard: assert `Process.get_keys/0` size is + unchanged across 1000 dispatcher invocations. + +### Verification + +```bash +mix format +mix compile --warnings-as-errors +mix test # 1749 tests pass +mix test --only lua53 # 29 tests pass + +# Memory gate +LUA_BENCH_MODE=full MIX_ENV=benchmark \ + mix run benchmarks/dispatcher_vs_interpreter.exs + +# Three-way memory comparison (custom script during dev) +# Dispatcher should be ≤227 MB on fib(25) + +# Smoke other workloads for regression +MIX_ENV=benchmark mix run benchmarks/fibonacci.exs +MIX_ENV=benchmark mix run benchmarks/{oop,closures,table_ops,string_ops}.exs + +# Memory attribution +MIX_ENV=benchmark mix profile.tprof --type memory -e '...' +# Confirm setelement + make_tuple combined < 10% of dispatcher allocations +``` + +## Risks + +- **Process dict throughput is not free.** Each `Process.put/2` does a + hash lookup and insert into the process's internal dict. The BEAM + implements this as an open-addressed hash table; for small dicts + (~16 entries) it's effectively constant-time, but it's not as fast + as `setelement` on a hot pre-existing tuple. **First measurement + could show time regression** even with memory wins. If time drops + below 1.3x vs interpreter on fib(25), this plan is the wrong bet + and we should revisit (likely back to tuple-based with smarter + in-place update hints, or accept the memory deficit). + +- **Reentry edge cases.** If a `:native_func` called from compiled + code recursively calls `Lua.eval!` (or any code path that re-enters + the dispatcher), the snapshot/restore at dispatcher entry must + cover it. A pcall/error-during-snapshot-restore could leak keys. + Mitigated by `try/after` at every entry point, but the test + surface for this is non-trivial. + +- **Process dict size growth on long-running processes.** If + snapshot/restore has a bug that leaves keys behind, the dict grows + unboundedly. Leak regression test guards this; the property test + in B5a-v2 already runs 1000 distinct evals — extend it to assert + `Process.get_keys/0` size is stable. + +- **No win for short-running workloads.** Programs that compile to + 10 bytecode opcodes and execute once won't see any meaningful + memory difference — the per-call snapshot now costs *more* + per call than the old per-write `setelement`. The break-even is + around 5-10 opcodes per call. Long workloads (fib, table ops, + string parsing) benefit; one-shot evals slightly regress on memory. + Acceptable tradeoff. + +- **Compilers / static analyzers.** Some Elixir style tools complain + about process-dict usage. Add a `# credo:disable-for-this-file` + or equivalent comment with rationale. The dispatcher is one of the + few places in the codebase where process-dict is the right tool — + this is a deliberate exception, not unidiomatic code. + +## Discoveries + +(Will be filled in during implementation.)