Refactor: clean up callable glue layer (4 commits)#768
Open
poursoul wants to merge 4 commits into
Open
Conversation
The kernel-upload prologue of prepare_callable_impl was byte-for-byte
identical in host_build_graph and tensormap_and_ringbuffer: upload the
ChipCallable buffer via HostApi, then walk child_offsets to compute each
child's device address and write it to func_id_to_addr_[].
Move that arithmetic into src/common/task_interface/prepare_callable_common.h
as upload_and_collect_child_addrs(). The helper takes the upload function
pointer directly (HostApi types diverge per-runtime) and returns a vector
of {func_id, device_addr}, leaving RUNTIME_MAX_FUNC_ID range validation
in the caller where the constant lives.
No behavior change. Both runtimes still hit the same DeviceRunner upload
path and write the same addresses to func_id_to_addr_[].
upload_chip_callable_buffer in the onboard and sim DeviceRunners both
opened with byte-for-byte identical math: walk child_offsets to compute
storage_used / total_size, then FNV-1a hash the buffer for dedup. The
onboard variant additionally rewrites each child's resolved_addr_ in a
host scratch to a device offset before rtMemcpy; sim writes dlopen'd
function pointers instead, so that step stays platform-specific.
Move the shared pieces to src/common/task_interface/chip_callable_layout.h:
- compute_chip_callable_layout(callable) -> {header_size, total_size,
content_hash}, mirroring make_callable<>()'s layout.
- patch_chip_callable_scratch_for_device(callable, layout, target_base,
scratch), the onboard device-offset rewrite.
Onboard now calls both helpers; sim calls compute_chip_callable_layout
and keeps its own dlopen+register_hooks loop. The fnv1a_64.h direct
include drops out of both device_runner.cpp files.
No behavior change.
…in loop _chip_process_loop and _chip_process_loop_with_bootstrap had identical TASK_READY / CONTROL_REQUEST / SHUTDOWN state machines: same mailbox decoding, same per-cid prepared set, same _ensure_prepared() lazy path, same run_prepared_from_blob call, byte-identical CONTROL_REQUEST switch. The only behavioral difference was that the bootstrap variant flushed store_to_host buffers after a successful kernel run, before publishing TASK_DONE. Pull the state machine into _run_chip_main_loop and inject the bootstrap hook via an on_task_done_success callback. Each entrypoint now only owns its own setup/teardown (init / bootstrap_context, finalize / shutdown_bootstrap+finalize). SHUTDOWN finalize moves from the loop body into a try/finally in the caller, matching what the bootstrap variant was already doing. The bootstrap variant was reaching into cw._impl.malloc/free/copy_to/ copy_from where the non-bootstrap path used the public cw methods; both now share the cw.* methods (thin int()-cast forwarders to _impl, so behavior is unchanged).
…facts prepare_callable_impl previously parked its outputs on four Runtime fields (pending_orch_so_data_/size_, pending_host_dlopen_handle_, pending_host_orch_func_ptr_) because its extern "C" signature only had one Runtime* slot. The c_api then drained those fields, re-extracted kernel_addrs from the Runtime's registered_kernels_ table, and called DeviceRunner::register_prepared_callable*. On run_prepared, DeviceRunner::bind_prepared_callable_to_runtime wrote the saved hbg dlopen handle + entry-symbol pointer back into the same pending_* slots so bind_prepared_to_runtime_impl could read them. Two problems: Runtime carried four short-lived staging fields that had no purpose at run time, and prepare_orch_so kept a dead fallback path that read pending_orch_so_data_ in case cid < 0 — impossible under the prepared-callable flow that has been the only supported path since c3c1ce0. Replace the sidecar with a single PreparedCallableArtifacts out-parameter on prepare_callable_impl, and add a host_orch_func_ptr arg to bind_prepared_to_runtime_impl (trb asserts it must be null). The c_api now drives prepare_callable / run_prepared without touching Runtime staging fields, and bind_prepared_callable_to_runtime returns the hbg host_orch_func_ptr via an out-pointer. Drops all four pending_* fields from both a2a3 and a5 runtime.h, and removes the cid < 0 fallback from prepare_orch_so (now LOG_ERRORs since prepared-callable is the only supported flow). a5 was on the pre-PR1/PR2 helper-free shape before this commit; the prepare_callable_impl rewrite for a5 also picks up the upload_and_collect_child_addrs helper so a2a3 and a5 stay in sync. No user-visible behavior change. All 8 variants (a2a3sim/a2a3/a5sim/a5 x hbg/trb) build cleanly.
There was a problem hiding this comment.
Code Review
This pull request refactors the prepared callable workflow across the a2a3 and a5 platforms by introducing a PreparedCallableArtifacts struct for data transfer, which allows for the removal of temporary staging fields from the Runtime class. It also consolidates ChipCallable layout math and kernel upload logic into shared headers to reduce duplication between onboard and simulation runners. Additionally, the Python worker's main loop was refactored to support post-task success hooks, such as flushing buffers. I have no feedback to provide.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four self-contained refactors of the ChipCallable prepare/run plumbing. No
user-visible behavior change; all four are pure cleanup of duplicated code
or sidecar state.
Summary
Extract
upload_and_collect_child_addrshelper. The kernel-uploadprologue of
prepare_callable_implwas byte-identical betweenhost_build_graphandtensormap_and_ringbuffer. Move it tosrc/common/task_interface/prepare_callable_common.h; the helper takesthe upload function pointer directly so it has no Runtime dependency.
Extract
chip_callable_layouthelpers.upload_chip_callable_bufferin the onboard and simDeviceRunnersshared the layout math (
storage_used,total_size) and FNV-1a deduphash. Move them to
chip_callable_layout.h. Onboard additionallyreuses the
patch_chip_callable_scratch_for_devicehelper thatrewrites each child's
resolved_addr_to a device offset; sim keepsits own dlopen+register_hooks loop.
Merge
_chip_process_loopand the bootstrap variant. Both hadidentical TASK_READY / CONTROL_REQUEST / SHUTDOWN state machines.
Extract
_run_chip_main_loopand inject the bootstrap-onlystore_to_host flush via an
on_task_done_successhook. The bootstrappath was also reaching into
cw._impl.malloc/_impl.copy_towherethe non-bootstrap path used the public
cw.*methods — both now usethe public path (thin int-cast forwarders, so unchanged behavior).
Replace
Runtime::pending_*sidecar withPreparedCallableArtifacts.prepare_callable_implused to parkhost_dlopen_handle,host_orch_func_ptr,orch_so_data, andorch_so_sizeon theRuntimebecause itsextern Csignature had only oneRuntime*slot. Replace with an out-parameter struct. On
run_prepared,bind_prepared_callable_to_runtimenow returns the hbghost_orch_func_ptrvia an out-pointer rather than restoring it ontoRuntime::pending_host_orch_func_ptr_;bind_prepared_to_runtime_implgains a
void *host_orch_func_ptrarg (trb asserts it must be null).All four
pending_*fields drop from both a2a3 and a5runtime.h,and the dead
cid < 0fallback inprepare_orch_sois replaced witha
LOG_ERROR(the prepared-callable flow is the only supported pathsince Feat: prepared callable — register + run(cid) on a unified ABI #710).
a5 caught up to the a2a3 helper-using shape in PR4 so the two stay in
sync.
Testing
a2a3sim,a2a3,a5sim,a5xhost_build_graph,tensormap_and_ringbuffer)pytest tests/st/a2a3/tensormap_and_ringbuffer/prepared_callable --platform a2a3sim— 6/6 pass (dlopen_count monotonicity + deduphits + same-cid replay)
pytest tests/st/a2a3/tensormap_and_ringbuffer --platform a2a3sim --device 0-1— L2 trb 23/23 + L3 dependency + L3 group all passpytest tests/st/a2a3/host_build_graph --platform a2a3sim --device 0-1— 9/9 passpytest tests/ut/py/test_worker/test_ensure_prepared.py— 4/4pass