Refactor: decouple logger SO + collapse simpler_init/bind_executors → device_init#735
Merged
ChaoWao merged 1 commit intoMay 12, 2026
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the PTO Runtime to move the binding of AICPU and AICore executor binaries from a per-run operation to a one-time initialization step. It introduces a new bind_executors C-API and updates the DeviceRunner class to store these binaries as members, significantly simplifying the run method signatures across all supported platforms. Feedback focuses on ensuring the documentation accurately reflects the simulation platform's binary loading lifecycle and addressing a potential resource leak in the initialization path where exceptions during file reading could bypass cleanup logic.
5de7363 to
694c1b6
Compare
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…e/run_prepared Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands: (1) Logger ownership moves entirely to libsimpler_log.so. Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log for HostLogger setup, then host_runtime.so cached log_level / log_info_v on every DeviceRunner so run_prepared could later forward them to AICPU. Log state lived in three places (HostLogger, runner member, KernelArgs) all seeded off the same C-ABI argument. After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init BEFORE host_runtime.so is even dlopened. HostLogger gains level()/info_v() raw getters. Every consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads from HostLogger::get_instance() directly. Log state is owned in exactly one place; no log argument ever travels through the host_runtime.so C ABI. (2) Executor binaries hoisted out of every prepare_callable / run_prepared call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable accepted and ignored them; run_prepared still threaded them per-launch via the C ABI just to reach DeviceRunner::run). Now: Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size, aicore*, aicore_size) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...) After: simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size) attach + executor takeover + (onboard) dlog sync prepare_callable(ctx, cid, callable) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, enable_l2_swimlane, enable_dump_tensor, enable_pmu, output_prefix) ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the bytes are read once in init() and transferred into DeviceRunner-owned vectors via simpler_init. ### Changes libsimpler_log.so: - HostLogger gains `int level() const` / `int info_v() const` raw getters. - New C export `simpler_log_init(int log_level, int log_info_v)` validates and forwards to HostLogger setters. host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls): - `simpler_init` signature: drops `log_level`/`log_info_v`; adds `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach + runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`. - `prepare_callable` signature: drops `device_id` + binary pointers (the upstream impl already ignored them). - `run_prepared` signature: drops `device_id` + binary pointers. Body reads `runner->device_id()` for the `prepare_run_context` call; runner binaries are already loaded. 4 × DeviceRunner.{h,cpp}: - New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`. - New `set_executors(aicpu, aicore)` setter (move-in, called from simpler_init); new `device_id() const` getter. - `run()` signature loses `device_id` + binary vectors; reads from members. - `ensure_device_initialized()` / `ensure_binaries_loaded()` argless. - Drop `log_level_` / `log_info_v_` members + `set_log_level` / `set_log_info_v` setters. Onboard `run()` reads `HostLogger::get_instance().level() / .info_v()` directly when populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` / `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is RTLD_GLOBAL so the AICPU sim SO resolves it the same way. ChipWorker: - Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is opened so any LOG_* macro firing during host_runtime's dlopen-time constructors already sees the right level. - Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read the binary bytes into local vectors that simpler_init moves into the runner, then drop the locals (no per-ChipWorker binary cache). - `prepare_callable` / `run_prepared` calls drop the binary args. - init()'s rollback path absorbs read_binary_file failures inside the same try/catch so partial state can't leak. Docs (chip-level-arch, dynamic-linking, logging, testing, python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings, init-flow ASCII art, and "configuration flow" table to reflect new shape. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so + libsimpler_log compile) - pytest tests/ut/py (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
694c1b6 to
7b7c60d
Compare
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 11, 2026
…e/run_prepared Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands: (1) Logger ownership moves entirely to libsimpler_log.so. Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log for HostLogger setup, then host_runtime.so cached log_level / log_info_v on every DeviceRunner so run_prepared could later forward them to AICPU. Log state lived in three places (HostLogger, runner member, KernelArgs) all seeded off the same C-ABI argument. After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init BEFORE host_runtime.so is even dlopened. HostLogger gains level()/info_v() raw getters. Every consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads from HostLogger::get_instance() directly. Log state is owned in exactly one place; no log argument ever travels through the host_runtime.so C ABI. (2) Executor binaries hoisted out of every prepare_callable / run_prepared call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable accepted and ignored them; run_prepared still threaded them per-launch via the C ABI just to reach DeviceRunner::run). Now: Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size, aicore*, aicore_size) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...) After: simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size) attach + executor takeover + (onboard) dlog sync prepare_callable(ctx, cid, callable) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, enable_l2_swimlane, enable_dump_tensor, enable_pmu, output_prefix) ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the bytes are read once in init() and transferred into DeviceRunner-owned vectors via simpler_init. libsimpler_log.so: - HostLogger gains `int level() const` / `int info_v() const` raw getters. - New C export `simpler_log_init(int log_level, int log_info_v)` validates and forwards to HostLogger setters. host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls): - `simpler_init` signature: drops `log_level`/`log_info_v`; adds `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach + runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`. - `prepare_callable` signature: drops `device_id` + binary pointers (the upstream impl already ignored them). - `run_prepared` signature: drops `device_id` + binary pointers. Body reads `runner->device_id()` for the `prepare_run_context` call; runner binaries are already loaded. 4 × DeviceRunner.{h,cpp}: - New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`. - New `set_executors(aicpu, aicore)` setter (move-in, called from simpler_init); new `device_id() const` getter. - `run()` signature loses `device_id` + binary vectors; reads from members. - `ensure_device_initialized()` / `ensure_binaries_loaded()` argless. - Drop `log_level_` / `log_info_v_` members + `set_log_level` / `set_log_info_v` setters. Onboard `run()` reads `HostLogger::get_instance().level() / .info_v()` directly when populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` / `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is RTLD_GLOBAL so the AICPU sim SO resolves it the same way. ChipWorker: - Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is opened so any LOG_* macro firing during host_runtime's dlopen-time constructors already sees the right level. - Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read the binary bytes into local vectors that simpler_init moves into the runner, then drop the locals (no per-ChipWorker binary cache). - `prepare_callable` / `run_prepared` calls drop the binary args. - init()'s rollback path absorbs read_binary_file failures inside the same try/catch so partial state can't leak. Docs (chip-level-arch, dynamic-linking, logging, testing, python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings, init-flow ASCII art, and "configuration flow" table to reflect new shape. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so + libsimpler_log compile) - pytest tests/ut/py (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
7b7c60d to
14ad620
Compare
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 12, 2026
…e/run_prepared Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands: (1) Logger ownership moves entirely to libsimpler_log.so. Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log for HostLogger setup, then host_runtime.so cached log_level / log_info_v on every DeviceRunner so run_prepared could later forward them to AICPU. Log state lived in three places (HostLogger, runner member, KernelArgs) all seeded off the same C-ABI argument. After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init BEFORE host_runtime.so is even dlopened. HostLogger gains level()/info_v() raw getters. Every consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads from HostLogger::get_instance() directly. Log state is owned in exactly one place; no log argument ever travels through the host_runtime.so C ABI. (2) Executor binaries hoisted out of every prepare_callable / run_prepared call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable accepted and ignored them; run_prepared still threaded them per-launch via the C ABI just to reach DeviceRunner::run). Now: Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size, aicore*, aicore_size) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...) After: simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size) attach + executor takeover + (onboard) dlog sync prepare_callable(ctx, cid, callable) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, enable_l2_swimlane, enable_dump_tensor, enable_pmu, output_prefix) ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the bytes are read once in init() and transferred into DeviceRunner-owned vectors via simpler_init. libsimpler_log.so: - HostLogger gains `int level() const` / `int info_v() const` raw getters. - New C export `simpler_log_init(int log_level, int log_info_v)` validates and forwards to HostLogger setters. host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls): - `simpler_init` signature: drops `log_level`/`log_info_v`; adds `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach + runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`. - `prepare_callable` signature: drops `device_id` + binary pointers (the upstream impl already ignored them). - `run_prepared` signature: drops `device_id` + binary pointers. Body reads `runner->device_id()` for the `prepare_run_context` call; runner binaries are already loaded. 4 × DeviceRunner.{h,cpp}: - New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`. - New `set_executors(aicpu, aicore)` setter (move-in, called from simpler_init); new `device_id() const` getter. - `run()` signature loses `device_id` + binary vectors; reads from members. - `ensure_device_initialized()` / `ensure_binaries_loaded()` argless. - Drop `log_level_` / `log_info_v_` members + `set_log_level` / `set_log_info_v` setters. Onboard `run()` reads `HostLogger::get_instance().level() / .info_v()` directly when populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` / `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is RTLD_GLOBAL so the AICPU sim SO resolves it the same way. ChipWorker: - Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is opened so any LOG_* macro firing during host_runtime's dlopen-time constructors already sees the right level. - Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read the binary bytes into local vectors that simpler_init moves into the runner, then drop the locals (no per-ChipWorker binary cache). - `prepare_callable` / `run_prepared` calls drop the binary args. - init()'s rollback path absorbs read_binary_file failures inside the same try/catch so partial state can't leak. Docs (chip-level-arch, dynamic-linking, logging, testing, python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings, init-flow ASCII art, and "configuration flow" table to reflect new shape. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so + libsimpler_log compile) - pytest tests/ut/py (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
14ad620 to
4162d49
Compare
…e/run_prepared Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands: (1) Logger ownership moves entirely to libsimpler_log.so. Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log for HostLogger setup, then host_runtime.so cached log_level / log_info_v on every DeviceRunner so run_prepared could later forward them to AICPU. Log state lived in three places (HostLogger, runner member, KernelArgs) all seeded off the same C-ABI argument. After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init BEFORE host_runtime.so is even dlopened. HostLogger gains level()/info_v() raw getters. Every consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads from HostLogger::get_instance() directly. Log state is owned in exactly one place; no log argument ever travels through the host_runtime.so C ABI. (2) Executor binaries hoisted out of every prepare_callable / run_prepared call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable accepted and ignored them; run_prepared still threaded them per-launch via the C ABI just to reach DeviceRunner::run). Now: Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size, aicore*, aicore_size) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...) After: simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size) attach + executor takeover + (onboard) dlog sync prepare_callable(ctx, cid, callable) run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num, enable_l2_swimlane, enable_dump_tensor, enable_pmu, output_prefix) ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the bytes are read once in init() and transferred into DeviceRunner-owned vectors via simpler_init. libsimpler_log.so: - HostLogger gains `int level() const` / `int info_v() const` raw getters. - New C export `simpler_log_init(int log_level, int log_info_v)` validates and forwards to HostLogger setters. host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls): - `simpler_init` signature: drops `log_level`/`log_info_v`; adds `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach + runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`. - `prepare_callable` signature: drops `device_id` + binary pointers (the upstream impl already ignored them). - `run_prepared` signature: drops `device_id` + binary pointers. Body reads `runner->device_id()` for the `prepare_run_context` call; runner binaries are already loaded. 4 × DeviceRunner.{h,cpp}: - New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`. - New `set_executors(aicpu, aicore)` setter (move-in, called from simpler_init); new `device_id() const` getter. - `run()` signature loses `device_id` + binary vectors; reads from members. - `ensure_device_initialized()` / `ensure_binaries_loaded()` argless. - Drop `log_level_` / `log_info_v_` members + `set_log_level` / `set_log_info_v` setters. Onboard `run()` reads `HostLogger::get_instance().level() / .info_v()` directly when populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` / `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is RTLD_GLOBAL so the AICPU sim SO resolves it the same way. ChipWorker: - Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is opened so any LOG_* macro firing during host_runtime's dlopen-time constructors already sees the right level. - Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read the binary bytes into local vectors that simpler_init moves into the runner, then drop the locals (no per-ChipWorker binary cache). - `prepare_callable` / `run_prepared` calls drop the binary args. - init()'s rollback path absorbs read_binary_file failures inside the same try/catch so partial state can't leak. Docs (chip-level-arch, dynamic-linking, logging, testing, python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings, init-flow ASCII art, and "configuration flow" table to reflect new shape. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . (all 4 host_runtime.so + libsimpler_log compile) - pytest tests/ut/py (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
4162d49 to
26a4684
Compare
Merged
4 tasks
ChaoWao
added a commit
to ChaoWao/simpler-fork
that referenced
this pull request
May 12, 2026
…pWorker::init to Python Continues the API-narrowing theme of hw-native-sys#723 / hw-native-sys#735. ChipWorker::init was the last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up into the Python `ChipWorker` wrapper, shrinking the C++ init signature from 8 args to 4. Before: void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib, device_id, sim_context_lib = "", log_level = 1, log_info_v = 5); After: void ChipWorker::init(host_lib, aicpu, aicore, device_id); ### Why this is safe `_task_interface.so` (the nanobind module that contains chip_worker.cpp) has no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches host_runtime.so purely via dlsym, and the binding code itself doesn't log. So the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of host_runtime.so, not module import. The Python wrapper does exactly that: 1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL) # once per process 2. <handle>.simpler_log_init(log_level, log_info_v) # seed HostLogger 3. if bins.sim_context_path: # sim only ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL) 4. self._impl.init(host_path, aicpu_path, aicore_path, device_id) A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads idempotent per path — the Python counterpart of the C++ side's old std::once_flag. ### Changes src/common/worker/chip_worker.{h,cpp}: - init() drops simpler_log_lib_path, sim_context_lib_path, log_level, log_info_v params. - Remove the g_simpler_log_* / g_sim_context_* file-scope globals, ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the simpler_log_init call. Drop the now-unused <mutex> include. - init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create device ctx → read executor binaries → simpler_init. python/bindings/task_interface.cpp: - `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path, aicore_path, device_id). python/simpler/task_interface.py: - New module-level `_preloaded_globals` registry + `_preload_global(path)` helper (ctypes.CDLL RTLD_GLOBAL, one per path). - ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is set, then call the 4-arg _impl.init. Wrapper's public signature (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no caller updates needed. tests/ut/py/test_chip_worker.py: - The three `_ChipWorker.init(...)` fault-path tests drop the `/nonexistent/libsimpler_log.so` argument (no longer a parameter). Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py, python/simpler/_log.py): updated the init-flow ASCII art / load-order section / configuration-flow table to show the preload happening in the Python wrapper before the C++ _ChipWorker.init. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . - pytest tests/ut/py (119 passed, 7 skipped; torch-missing tests excluded as before) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
poursoul
added a commit
to poursoul/simpler
that referenced
this pull request
May 12, 2026
Hoist `_ensure_prepared` out of the two chip-child loops into a single module-level helper so the lazy/eager prepare branches stay in sync. Add UT coverage for the lazy-prewarm fallback path, which previously had none. Unify the `prepared_callable_path_used_` comment across all four device_runner.h headers and note the legacy-path assumption it rests on. Clears the remaining follow-up items from PR hw-native-sys#710 / hw-native-sys#735.
3 tasks
ChaoWao
added a commit
that referenced
this pull request
May 12, 2026
…pWorker::init to Python (#746) Continues the API-narrowing theme of #723 / #735. ChipWorker::init was the last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up into the Python `ChipWorker` wrapper, shrinking the C++ init signature from 8 args to 4. Before: void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib, device_id, sim_context_lib = "", log_level = 1, log_info_v = 5); After: void ChipWorker::init(host_lib, aicpu, aicore, device_id); ### Why this is safe `_task_interface.so` (the nanobind module that contains chip_worker.cpp) has no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches host_runtime.so purely via dlsym, and the binding code itself doesn't log. So the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of host_runtime.so, not module import. The Python wrapper does exactly that: 1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL) # once per process 2. <handle>.simpler_log_init(log_level, log_info_v) # seed HostLogger 3. if bins.sim_context_path: # sim only ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL) 4. self._impl.init(host_path, aicpu_path, aicore_path, device_id) A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads idempotent per path — the Python counterpart of the C++ side's old std::once_flag. ### Changes src/common/worker/chip_worker.{h,cpp}: - init() drops simpler_log_lib_path, sim_context_lib_path, log_level, log_info_v params. - Remove the g_simpler_log_* / g_sim_context_* file-scope globals, ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the simpler_log_init call. Drop the now-unused <mutex> include. - init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create device ctx → read executor binaries → simpler_init. python/bindings/task_interface.cpp: - `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path, aicore_path, device_id). python/simpler/task_interface.py: - New module-level `_preloaded_globals` registry + `_preload_global(path)` helper (ctypes.CDLL RTLD_GLOBAL, one per path). - ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is set, then call the 4-arg _impl.init. Wrapper's public signature (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no caller updates needed. tests/ut/py/test_chip_worker.py: - The three `_ChipWorker.init(...)` fault-path tests drop the `/nonexistent/libsimpler_log.so` argument (no longer a parameter). Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py, python/simpler/_log.py): updated the init-flow ASCII art / load-order section / configuration-flow table to show the preload happening in the Python wrapper before the C++ _ChipWorker.init. Verified locally on a2a3sim + a5sim: - pip install --no-build-isolation -e . - pytest tests/ut/py (119 passed, 7 skipped; torch-missing tests excluded as before) - examples/workers/l2/{hello_worker, worker_malloc} on both sims Onboard ut + st coverage runs in CI (Linux).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continues the API-narrowing theme of #723. Two strands woven together — both
amount to "one piece of state lives in one place; no redundant C-ABI
plumbing".
(1) Logger ownership moves entirely to
libsimpler_log.soBefore:
simpler_initlived onhost_runtime.sobut reached cross-SO intolibsimpler_log.soto mutateHostLogger, then cachedlog_level/log_info_von everyDeviceRunnersorun_runtimecould forward them toAICPU via
KernelArgs. Log state lived in three places (HostLogger, runnermember, KernelArgs) all seeded off the same C-ABI argument.
After:
libsimpler_log.soexports its ownsimpler_log_init(level, info_v)C entry, called fromChipWorker::initbeforehost_runtime.sois even
dlopen'd.HostLoggergainslevel()/info_v()raw getters.Every consumer (host_runtime populating KernelArgs, AICPU sim SO setters,
onboard CANN dlog sync) reads
HostLogger::get_instance()directly. Logstate lives in exactly one place; no log argument ever travels through
the
host_runtime.soC ABI.(2)
host_runtime.so's init surface collapses to one entry:device_initBefore:
simpler_init(ctx, device_id, log_level, log_info_v)+bind_executors(ctx, aicpu_*, aicore_*)— two adjacent init-time entriesalways called back-to-back.
After:
device_init(ctx, device_id, aicpu_*, aicore_*)— single entrythat attaches the calling thread, takes ownership of executor binaries, and
(onboard) syncs CANN dlog from
HostLogger. Log args gone because (1) putthem on a separate SO.
What changed
libsimpler_log.so(src/common/log/host_log.{h,cpp})HostLogger::level()/info_v()getters; newsimpler_log_init(int, int)C exportpto_runtime_c_api.hsimpler_init+bind_executorsremoved; newdevice_init(ctx, device_id, aicpu_*, aicore_*); dlsym list updatedpto_runtime_c_api.cppdevice_init; onboard'sdlog_setlevelreadsHostLogger::get_instance().level()DeviceRunner.{h,cpp}log_level_/log_info_v_members +set_log_level/set_log_info_vsetters;run()readsHostLogger::get_instance()directlyChipWorker.{h,cpp}simpler_log_initfromlibsimpler_log.so, call it beforehost_runtime.sodlopen; replacesimpler_init_fn_+bind_executors_fn_with singledevice_init_fn_; one rc check + one rollbackchip-level-arch.md,dynamic-linking.md,logging.md,testing.md,python/simpler/{__init__.py,_log.py},worker_malloc/README.mdall refreshedEnd-state diagram
Test plan
pip install --no-build-isolation -e .— all 4 host_runtime.so + libsimpler_log compile clean on macOS (a2a3sim + a5sim)pytest tests/ut/py— 116 passed, 7 skipped (unrelated torch-gated)examples/workers/l2/{hello_worker, worker_malloc}/main.pyon a2a3sim + a5simtests/ut/py/test_worker/test_bootstrap_context_sim.py(5 passed, exercisesinit → device_init → runend-to-end on sim)ut-a2a3/ut-a5/st-onboard-*(Linux + hardware)ctest --test-dir build/ut_cpp(Linux)