Refactor: clean up callable glue layer (4 commits) by poursoul · Pull Request #768 · hw-native-sys/simpler

poursoul · 2026-05-13T07:23:34Z

Four self-contained refactors of the ChipCallable prepare/run plumbing. No
user-visible behavior change; all four are pure cleanup of duplicated code
or sidecar state.

Summary

Extract upload_and_collect_child_addrs helper. The kernel-upload
prologue of prepare_callable_impl was byte-identical between
host_build_graph and tensormap_and_ringbuffer. Move it to
src/common/task_interface/prepare_callable_common.h; the helper takes
the upload function pointer directly so it has no Runtime dependency.
Extract chip_callable_layout helpers.
upload_chip_callable_buffer in the onboard and sim DeviceRunners
shared the layout math (storage_used, total_size) and FNV-1a dedup
hash. Move them to chip_callable_layout.h. Onboard additionally
reuses the patch_chip_callable_scratch_for_device helper that
rewrites each child's resolved_addr_ to a device offset; sim keeps
its own dlopen+register_hooks loop.
Merge _chip_process_loop and the bootstrap variant. Both had
identical TASK_READY / CONTROL_REQUEST / SHUTDOWN state machines.
Extract _run_chip_main_loop and inject the bootstrap-only
store_to_host flush via an on_task_done_success hook. The bootstrap
path was also reaching into cw._impl.malloc / _impl.copy_to where
the non-bootstrap path used the public cw.* methods — both now use
the public path (thin int-cast forwarders, so unchanged behavior).
Replace Runtime::pending_* sidecar with PreparedCallableArtifacts.
prepare_callable_impl used to park host_dlopen_handle,
host_orch_func_ptr, orch_so_data, and orch_so_size on the
Runtime because its extern C signature had only one Runtime*
slot. Replace with an out-parameter struct. On run_prepared,
bind_prepared_callable_to_runtime now returns the hbg
host_orch_func_ptr via an out-pointer rather than restoring it onto
Runtime::pending_host_orch_func_ptr_; bind_prepared_to_runtime_impl
gains a void *host_orch_func_ptr arg (trb asserts it must be null).
All four pending_* fields drop from both a2a3 and a5 runtime.h,
and the dead cid < 0 fallback in prepare_orch_so is replaced with
a LOG_ERROR (the prepared-callable flow is the only supported path
since Feat: prepared callable — register + run(cid) on a unified ABI #710).

a5 caught up to the a2a3 helper-using shape in PR4 so the two stay in
sync.

Testing

All 8 variants build cleanly (a2a3sim, a2a3, a5sim, a5 x
host_build_graph, tensormap_and_ringbuffer)
pytest tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable --platform a2a3sim — 6/6 pass (dlopen_count monotonicity + dedup
hits + same-cid replay)
pytest tests/st/a2a3/tensormap_and_ringbuffer --platform a2a3sim --device 0-1 — L2 trb 23/23 + L3 dependency + L3 group all pass
pytest tests/st/a2a3/host_build_graph --platform a2a3sim --device 0-1 — 9/9 pass
pytest tests/ut/py/test_worker/test_ensure_prepared.py — 4/4
pass

The kernel-upload prologue of prepare_callable_impl was byte-for-byte identical in host_build_graph and tensormap_and_ringbuffer: upload the ChipCallable buffer via HostApi, then walk child_offsets to compute each child's device address and write it to func_id_to_addr_[]. Move that arithmetic into src/common/task_interface/prepare_callable_common.h as upload_and_collect_child_addrs(). The helper takes the upload function pointer directly (HostApi types diverge per-runtime) and returns a vector of {func_id, device_addr}, leaving RUNTIME_MAX_FUNC_ID range validation in the caller where the constant lives. No behavior change. Both runtimes still hit the same DeviceRunner upload path and write the same addresses to func_id_to_addr_[].

upload_chip_callable_buffer in the onboard and sim DeviceRunners both opened with byte-for-byte identical math: walk child_offsets to compute storage_used / total_size, then FNV-1a hash the buffer for dedup. The onboard variant additionally rewrites each child's resolved_addr_ in a host scratch to a device offset before rtMemcpy; sim writes dlopen'd function pointers instead, so that step stays platform-specific. Move the shared pieces to src/common/task_interface/chip_callable_layout.h: - compute_chip_callable_layout(callable) -> {header_size, total_size, content_hash}, mirroring make_callable<>()'s layout. - patch_chip_callable_scratch_for_device(callable, layout, target_base, scratch), the onboard device-offset rewrite. Onboard now calls both helpers; sim calls compute_chip_callable_layout and keeps its own dlopen+register_hooks loop. The fnv1a_64.h direct include drops out of both device_runner.cpp files. No behavior change.

…in loop _chip_process_loop and _chip_process_loop_with_bootstrap had identical TASK_READY / CONTROL_REQUEST / SHUTDOWN state machines: same mailbox decoding, same per-cid prepared set, same _ensure_prepared() lazy path, same run_prepared_from_blob call, byte-identical CONTROL_REQUEST switch. The only behavioral difference was that the bootstrap variant flushed store_to_host buffers after a successful kernel run, before publishing TASK_DONE. Pull the state machine into _run_chip_main_loop and inject the bootstrap hook via an on_task_done_success callback. Each entrypoint now only owns its own setup/teardown (init / bootstrap_context, finalize / shutdown_bootstrap+finalize). SHUTDOWN finalize moves from the loop body into a try/finally in the caller, matching what the bootstrap variant was already doing. The bootstrap variant was reaching into cw._impl.malloc/free/copy_to/ copy_from where the non-bootstrap path used the public cw methods; both now share the cw.* methods (thin int()-cast forwarders to _impl, so behavior is unchanged).

…facts prepare_callable_impl previously parked its outputs on four Runtime fields (pending_orch_so_data_/size_, pending_host_dlopen_handle_, pending_host_orch_func_ptr_) because its extern "C" signature only had one Runtime* slot. The c_api then drained those fields, re-extracted kernel_addrs from the Runtime's registered_kernels_ table, and called DeviceRunner::register_prepared_callable*. On run_prepared, DeviceRunner::bind_prepared_callable_to_runtime wrote the saved hbg dlopen handle + entry-symbol pointer back into the same pending_* slots so bind_prepared_to_runtime_impl could read them. Two problems: Runtime carried four short-lived staging fields that had no purpose at run time, and prepare_orch_so kept a dead fallback path that read pending_orch_so_data_ in case cid < 0 — impossible under the prepared-callable flow that has been the only supported path since c3c1ce0. Replace the sidecar with a single PreparedCallableArtifacts out-parameter on prepare_callable_impl, and add a host_orch_func_ptr arg to bind_prepared_to_runtime_impl (trb asserts it must be null). The c_api now drives prepare_callable / run_prepared without touching Runtime staging fields, and bind_prepared_callable_to_runtime returns the hbg host_orch_func_ptr via an out-pointer. Drops all four pending_* fields from both a2a3 and a5 runtime.h, and removes the cid < 0 fallback from prepare_orch_so (now LOG_ERRORs since prepared-callable is the only supported flow). a5 was on the pre-PR1/PR2 helper-free shape before this commit; the prepare_callable_impl rewrite for a5 also picks up the upload_and_collect_child_addrs helper so a2a3 and a5 stay in sync. No user-visible behavior change. All 8 variants (a2a3sim/a2a3/a5sim/a5 x hbg/trb) build cleanly.

gemini-code-assist

Code Review

This pull request refactors the prepared callable workflow across the a2a3 and a5 platforms by introducing a PreparedCallableArtifacts struct for data transfer, which allows for the removal of temporary staging fields from the Runtime class. It also consolidates ChipCallable layout math and kernel upload logic into shared headers to reduce duplication between onboard and simulation runners. Additionally, the Python worker's main loop was refactored to support post-task success hooks, such as flushing buffers. I have no feedback to provide.

poursoul added 4 commits May 13, 2026 15:19

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: clean up callable glue layer (4 commits)#768

Refactor: clean up callable glue layer (4 commits)#768
poursoul wants to merge 4 commits into
mainfrom
refactor/callable-prepare-helper

poursoul commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poursoul commented May 13, 2026

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant