Skip to content

Refactor: clean up callable glue layer (4 commits)#768

Open
poursoul wants to merge 4 commits into
mainfrom
refactor/callable-prepare-helper
Open

Refactor: clean up callable glue layer (4 commits)#768
poursoul wants to merge 4 commits into
mainfrom
refactor/callable-prepare-helper

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

Four self-contained refactors of the ChipCallable prepare/run plumbing. No
user-visible behavior change; all four are pure cleanup of duplicated code
or sidecar state.

Summary

  1. Extract upload_and_collect_child_addrs helper. The kernel-upload
    prologue of prepare_callable_impl was byte-identical between
    host_build_graph and tensormap_and_ringbuffer. Move it to
    src/common/task_interface/prepare_callable_common.h; the helper takes
    the upload function pointer directly so it has no Runtime dependency.

  2. Extract chip_callable_layout helpers.
    upload_chip_callable_buffer in the onboard and sim DeviceRunners
    shared the layout math (storage_used, total_size) and FNV-1a dedup
    hash. Move them to chip_callable_layout.h. Onboard additionally
    reuses the patch_chip_callable_scratch_for_device helper that
    rewrites each child's resolved_addr_ to a device offset; sim keeps
    its own dlopen+register_hooks loop.

  3. Merge _chip_process_loop and the bootstrap variant. Both had
    identical TASK_READY / CONTROL_REQUEST / SHUTDOWN state machines.
    Extract _run_chip_main_loop and inject the bootstrap-only
    store_to_host flush via an on_task_done_success hook. The bootstrap
    path was also reaching into cw._impl.malloc / _impl.copy_to where
    the non-bootstrap path used the public cw.* methods — both now use
    the public path (thin int-cast forwarders, so unchanged behavior).

  4. Replace Runtime::pending_* sidecar with PreparedCallableArtifacts.
    prepare_callable_impl used to park host_dlopen_handle,
    host_orch_func_ptr, orch_so_data, and orch_so_size on the
    Runtime because its extern C signature had only one Runtime*
    slot. Replace with an out-parameter struct. On run_prepared,
    bind_prepared_callable_to_runtime now returns the hbg
    host_orch_func_ptr via an out-pointer rather than restoring it onto
    Runtime::pending_host_orch_func_ptr_; bind_prepared_to_runtime_impl
    gains a void *host_orch_func_ptr arg (trb asserts it must be null).
    All four pending_* fields drop from both a2a3 and a5 runtime.h,
    and the dead cid < 0 fallback in prepare_orch_so is replaced with
    a LOG_ERROR (the prepared-callable flow is the only supported path
    since Feat: prepared callable — register + run(cid) on a unified ABI #710).

a5 caught up to the a2a3 helper-using shape in PR4 so the two stay in
sync.

Testing

  • All 8 variants build cleanly (a2a3sim, a2a3, a5sim, a5 x
    host_build_graph, tensormap_and_ringbuffer)
  • pytest tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable --platform a2a3sim — 6/6 pass (dlopen_count monotonicity + dedup
    hits + same-cid replay)
  • pytest tests/st/a2a3/tensormap_and_ringbuffer --platform a2a3sim --device 0-1 — L2 trb 23/23 + L3 dependency + L3 group all pass
  • pytest tests/st/a2a3/host_build_graph --platform a2a3sim --device 0-1 — 9/9 pass
  • pytest tests/ut/py/test_worker/test_ensure_prepared.py — 4/4
    pass

poursoul added 4 commits May 13, 2026 15:19
The kernel-upload prologue of prepare_callable_impl was byte-for-byte
identical in host_build_graph and tensormap_and_ringbuffer: upload the
ChipCallable buffer via HostApi, then walk child_offsets to compute each
child's device address and write it to func_id_to_addr_[].

Move that arithmetic into src/common/task_interface/prepare_callable_common.h
as upload_and_collect_child_addrs(). The helper takes the upload function
pointer directly (HostApi types diverge per-runtime) and returns a vector
of {func_id, device_addr}, leaving RUNTIME_MAX_FUNC_ID range validation
in the caller where the constant lives.

No behavior change. Both runtimes still hit the same DeviceRunner upload
path and write the same addresses to func_id_to_addr_[].
upload_chip_callable_buffer in the onboard and sim DeviceRunners both
opened with byte-for-byte identical math: walk child_offsets to compute
storage_used / total_size, then FNV-1a hash the buffer for dedup. The
onboard variant additionally rewrites each child's resolved_addr_ in a
host scratch to a device offset before rtMemcpy; sim writes dlopen'd
function pointers instead, so that step stays platform-specific.

Move the shared pieces to src/common/task_interface/chip_callable_layout.h:

  - compute_chip_callable_layout(callable) -> {header_size, total_size,
    content_hash}, mirroring make_callable<>()'s layout.
  - patch_chip_callable_scratch_for_device(callable, layout, target_base,
    scratch), the onboard device-offset rewrite.

Onboard now calls both helpers; sim calls compute_chip_callable_layout
and keeps its own dlopen+register_hooks loop. The fnv1a_64.h direct
include drops out of both device_runner.cpp files.

No behavior change.
…in loop

_chip_process_loop and _chip_process_loop_with_bootstrap had identical
TASK_READY / CONTROL_REQUEST / SHUTDOWN state machines: same mailbox
decoding, same per-cid prepared set, same _ensure_prepared() lazy path,
same run_prepared_from_blob call, byte-identical CONTROL_REQUEST switch.
The only behavioral difference was that the bootstrap variant flushed
store_to_host buffers after a successful kernel run, before publishing
TASK_DONE.

Pull the state machine into _run_chip_main_loop and inject the bootstrap
hook via an on_task_done_success callback. Each entrypoint now only owns
its own setup/teardown (init / bootstrap_context, finalize /
shutdown_bootstrap+finalize). SHUTDOWN finalize moves from the loop body
into a try/finally in the caller, matching what the bootstrap variant
was already doing.

The bootstrap variant was reaching into cw._impl.malloc/free/copy_to/
copy_from where the non-bootstrap path used the public cw methods; both
now share the cw.* methods (thin int()-cast forwarders to _impl, so
behavior is unchanged).
…facts

prepare_callable_impl previously parked its outputs on four Runtime
fields (pending_orch_so_data_/size_, pending_host_dlopen_handle_,
pending_host_orch_func_ptr_) because its extern "C" signature only had
one Runtime* slot. The c_api then drained those fields, re-extracted
kernel_addrs from the Runtime's registered_kernels_ table, and called
DeviceRunner::register_prepared_callable*. On run_prepared,
DeviceRunner::bind_prepared_callable_to_runtime wrote the saved hbg
dlopen handle + entry-symbol pointer back into the same pending_* slots
so bind_prepared_to_runtime_impl could read them.

Two problems: Runtime carried four short-lived staging fields that had
no purpose at run time, and prepare_orch_so kept a dead fallback path
that read pending_orch_so_data_ in case cid < 0 — impossible under the
prepared-callable flow that has been the only supported path since
c3c1ce0.

Replace the sidecar with a single PreparedCallableArtifacts out-parameter
on prepare_callable_impl, and add a host_orch_func_ptr arg to
bind_prepared_to_runtime_impl (trb asserts it must be null). The c_api
now drives prepare_callable / run_prepared without touching Runtime
staging fields, and bind_prepared_callable_to_runtime returns the hbg
host_orch_func_ptr via an out-pointer.

Drops all four pending_* fields from both a2a3 and a5 runtime.h, and
removes the cid < 0 fallback from prepare_orch_so (now LOG_ERRORs since
prepared-callable is the only supported flow).

a5 was on the pre-PR1/PR2 helper-free shape before this commit; the
prepare_callable_impl rewrite for a5 also picks up the
upload_and_collect_child_addrs helper so a2a3 and a5 stay in sync.

No user-visible behavior change. All 8 variants
(a2a3sim/a2a3/a5sim/a5 x hbg/trb) build cleanly.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the prepared callable workflow across the a2a3 and a5 platforms by introducing a PreparedCallableArtifacts struct for data transfer, which allows for the removal of temporary staging fields from the Runtime class. It also consolidates ChipCallable layout math and kernel upload logic into shared headers to reduce duplication between onboard and simulation runners. Additionally, the Python worker's main loop was refactored to support post-task success hooks, such as flushing buffers. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant