Skip to content

Refactor: decouple logger SO + collapse simpler_init/bind_executors → device_init#735

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/hoist-executors-out-of-run-runtime
May 12, 2026
Merged

Refactor: decouple logger SO + collapse simpler_init/bind_executors → device_init#735
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:refactor/hoist-executors-out-of-run-runtime

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 11, 2026

Summary

Continues the API-narrowing theme of #723. Two strands woven together — both
amount to "one piece of state lives in one place; no redundant C-ABI
plumbing".

(1) Logger ownership moves entirely to libsimpler_log.so

Before: simpler_init lived on host_runtime.so but reached cross-SO into
libsimpler_log.so to mutate HostLogger, then cached log_level /
log_info_v on every DeviceRunner so run_runtime could forward them to
AICPU via KernelArgs. Log state lived in three places (HostLogger, runner
member, KernelArgs) all seeded off the same C-ABI argument.

After: libsimpler_log.so exports its own simpler_log_init(level, info_v) C entry, called from ChipWorker::init before host_runtime.so
is even dlopen'd. HostLogger gains level()/info_v() raw getters.
Every consumer (host_runtime populating KernelArgs, AICPU sim SO setters,
onboard CANN dlog sync) reads HostLogger::get_instance() directly. Log
state lives in exactly one place; no log argument ever travels through
the host_runtime.so C ABI.

(2) host_runtime.so's init surface collapses to one entry: device_init

Before: simpler_init(ctx, device_id, log_level, log_info_v) +
bind_executors(ctx, aicpu_*, aicore_*) — two adjacent init-time entries
always called back-to-back.

After: device_init(ctx, device_id, aicpu_*, aicore_*) — single entry
that attaches the calling thread, takes ownership of executor binaries, and
(onboard) syncs CANN dlog from HostLogger. Log args gone because (1) put
them on a separate SO.

What changed

File group Change
libsimpler_log.so (src/common/log/host_log.{h,cpp}) HostLogger::level() / info_v() getters; new simpler_log_init(int, int) C export
pto_runtime_c_api.h simpler_init + bind_executors removed; new device_init(ctx, device_id, aicpu_*, aicore_*); dlsym list updated
4 × pto_runtime_c_api.cpp Replace old entries with single device_init; onboard's dlog_setlevel reads HostLogger::get_instance().level()
4 × DeviceRunner.{h,cpp} Drop log_level_ / log_info_v_ members + set_log_level / set_log_info_v setters; run() reads HostLogger::get_instance() directly
ChipWorker.{h,cpp} dlsym simpler_log_init from libsimpler_log.so, call it before host_runtime.so dlopen; replace simpler_init_fn_ + bind_executors_fn_ with single device_init_fn_; one rc check + one rollback
Docs chip-level-arch.md, dynamic-linking.md, logging.md, testing.md, python/simpler/{__init__.py,_log.py}, worker_malloc/README.md all refreshed

End-state diagram

libsimpler_log.so      ← single owner of log state
├── simpler_log_init(level, info_v)            ← only writer
└── HostLogger::get_instance().level()/.info_v() ← anyone can read

host_runtime.so       ← log args gone from every entry
├── create_device_context()
├── device_init(ctx, device_id, aicpu_*, aicore_*)
│   ├── attach_current_thread(device_id)
│   ├── set_executors(move aicpu, move aicore)
│   └── (onboard) dlog_setlevel(-1, HostLogger::get_instance().level(), 0)
├── run_runtime(ctx, runtime, callable, args, block_dim, aicpu_thread_num,
│               enable_l2_swimlane, enable_dump_tensor, enable_pmu, output_prefix)
│   └── KernelArgs.log_level = HostLogger::get_instance().level();
├── finalize_device(ctx) / destroy_device_context(ctx)
└── (ACL / comm extension surface, unchanged)

Test plan

  • pip install --no-build-isolation -e . — all 4 host_runtime.so + libsimpler_log compile clean on macOS (a2a3sim + a5sim)
  • pytest tests/ut/py — 116 passed, 7 skipped (unrelated torch-gated)
  • examples/workers/l2/{hello_worker, worker_malloc}/main.py on a2a3sim + a5sim
  • tests/ut/py/test_worker/test_bootstrap_context_sim.py (5 passed, exercises init → device_init → run end-to-end on sim)
  • Onboard CI: ut-a2a3 / ut-a5 / st-onboard-* (Linux + hardware)
  • ctest --test-dir build/ut_cpp (Linux)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the PTO Runtime to move the binding of AICPU and AICore executor binaries from a per-run operation to a one-time initialization step. It introduces a new bind_executors C-API and updates the DeviceRunner class to store these binaries as members, significantly simplifying the run method signatures across all supported platforms. Feedback focuses on ensuring the documentation accurately reflects the simulation platform's binary loading lifecycle and addressing a potential resource leak in the initialization path where exceptions during file reading could bypass cleanup logic.

Comment thread docs/dynamic-linking.md Outdated
Comment thread src/common/worker/chip_worker.cpp Outdated
@ChaoWao ChaoWao force-pushed the refactor/hoist-executors-out-of-run-runtime branch 3 times, most recently from 5de7363 to 694c1b6 Compare May 11, 2026 05:03
@ChaoWao ChaoWao changed the title Refactor: hoist AICPU/AICore binaries + device_id out of run_runtime Refactor: decouple logger SO + collapse simpler_init/bind_executors → device_init May 11, 2026
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…e/run_prepared

Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands:

(1) Logger ownership moves entirely to libsimpler_log.so.

Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log
        for HostLogger setup, then host_runtime.so cached log_level /
        log_info_v on every DeviceRunner so run_prepared could later forward
        them to AICPU. Log state lived in three places (HostLogger, runner
        member, KernelArgs) all seeded off the same C-ABI argument.
After:  libsimpler_log.so exports its own simpler_log_init(level, info_v) C
        entry, called from ChipWorker::init BEFORE host_runtime.so is even
        dlopened. HostLogger gains level()/info_v() raw getters. Every
        consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads
        from HostLogger::get_instance() directly. Log state is owned in
        exactly one place; no log argument ever travels through the
        host_runtime.so C ABI.

(2) Executor binaries hoisted out of every prepare_callable / run_prepared
    call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable
    accepted and ignored them; run_prepared still threaded them per-launch
    via the C ABI just to reach DeviceRunner::run). Now:

Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size,
                         aicore*, aicore_size)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...)
After:  simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size)
          attach + executor takeover + (onboard) dlog sync
        prepare_callable(ctx, cid, callable)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     enable_l2_swimlane, enable_dump_tensor, enable_pmu,
                     output_prefix)

ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the
bytes are read once in init() and transferred into DeviceRunner-owned
vectors via simpler_init.

### Changes

libsimpler_log.so:
- HostLogger gains `int level() const` / `int info_v() const` raw getters.
- New C export `simpler_log_init(int log_level, int log_info_v)` validates
  and forwards to HostLogger setters.

host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls):
- `simpler_init` signature: drops `log_level`/`log_info_v`; adds
  `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach +
  runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`.
- `prepare_callable` signature: drops `device_id` + binary pointers (the
  upstream impl already ignored them).
- `run_prepared` signature: drops `device_id` + binary pointers. Body
  reads `runner->device_id()` for the `prepare_run_context` call; runner
  binaries are already loaded.

4 × DeviceRunner.{h,cpp}:
- New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`.
- New `set_executors(aicpu, aicore)` setter (move-in, called from
  simpler_init); new `device_id() const` getter.
- `run()` signature loses `device_id` + binary vectors; reads from members.
- `ensure_device_initialized()` / `ensure_binaries_loaded()` argless.
- Drop `log_level_` / `log_info_v_` members + `set_log_level` /
  `set_log_info_v` setters. Onboard `run()` reads
  `HostLogger::get_instance().level() / .info_v()` directly when
  populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` /
  `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is
  RTLD_GLOBAL so the AICPU sim SO resolves it the same way.

ChipWorker:
- Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the
  existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is
  opened so any LOG_* macro firing during host_runtime's dlopen-time
  constructors already sees the right level.
- Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read
  the binary bytes into local vectors that simpler_init moves into the
  runner, then drop the locals (no per-ChipWorker binary cache).
- `prepare_callable` / `run_prepared` calls drop the binary args.
- init()'s rollback path absorbs read_binary_file failures inside the same
  try/catch so partial state can't leak.

Docs (chip-level-arch, dynamic-linking, logging, testing,
python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings,
init-flow ASCII art, and "configuration flow" table to reflect new shape.

Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e .  (all 4 host_runtime.so + libsimpler_log compile)
- pytest tests/ut/py  (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated)
- examples/workers/l2/{hello_worker, worker_malloc} on both sims

Onboard ut + st coverage runs in CI (Linux).
@ChaoWao ChaoWao force-pushed the refactor/hoist-executors-out-of-run-runtime branch from 694c1b6 to 7b7c60d Compare May 11, 2026 09:01
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 11, 2026
…e/run_prepared

Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands:

(1) Logger ownership moves entirely to libsimpler_log.so.

Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log
        for HostLogger setup, then host_runtime.so cached log_level /
        log_info_v on every DeviceRunner so run_prepared could later forward
        them to AICPU. Log state lived in three places (HostLogger, runner
        member, KernelArgs) all seeded off the same C-ABI argument.
After:  libsimpler_log.so exports its own simpler_log_init(level, info_v) C
        entry, called from ChipWorker::init BEFORE host_runtime.so is even
        dlopened. HostLogger gains level()/info_v() raw getters. Every
        consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads
        from HostLogger::get_instance() directly. Log state is owned in
        exactly one place; no log argument ever travels through the
        host_runtime.so C ABI.

(2) Executor binaries hoisted out of every prepare_callable / run_prepared
    call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable
    accepted and ignored them; run_prepared still threaded them per-launch
    via the C ABI just to reach DeviceRunner::run). Now:

Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size,
                         aicore*, aicore_size)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...)
After:  simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size)
          attach + executor takeover + (onboard) dlog sync
        prepare_callable(ctx, cid, callable)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     enable_l2_swimlane, enable_dump_tensor, enable_pmu,
                     output_prefix)

ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the
bytes are read once in init() and transferred into DeviceRunner-owned
vectors via simpler_init.

libsimpler_log.so:
- HostLogger gains `int level() const` / `int info_v() const` raw getters.
- New C export `simpler_log_init(int log_level, int log_info_v)` validates
  and forwards to HostLogger setters.

host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls):
- `simpler_init` signature: drops `log_level`/`log_info_v`; adds
  `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach +
  runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`.
- `prepare_callable` signature: drops `device_id` + binary pointers (the
  upstream impl already ignored them).
- `run_prepared` signature: drops `device_id` + binary pointers. Body
  reads `runner->device_id()` for the `prepare_run_context` call; runner
  binaries are already loaded.

4 × DeviceRunner.{h,cpp}:
- New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`.
- New `set_executors(aicpu, aicore)` setter (move-in, called from
  simpler_init); new `device_id() const` getter.
- `run()` signature loses `device_id` + binary vectors; reads from members.
- `ensure_device_initialized()` / `ensure_binaries_loaded()` argless.
- Drop `log_level_` / `log_info_v_` members + `set_log_level` /
  `set_log_info_v` setters. Onboard `run()` reads
  `HostLogger::get_instance().level() / .info_v()` directly when
  populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` /
  `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is
  RTLD_GLOBAL so the AICPU sim SO resolves it the same way.

ChipWorker:
- Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the
  existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is
  opened so any LOG_* macro firing during host_runtime's dlopen-time
  constructors already sees the right level.
- Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read
  the binary bytes into local vectors that simpler_init moves into the
  runner, then drop the locals (no per-ChipWorker binary cache).
- `prepare_callable` / `run_prepared` calls drop the binary args.
- init()'s rollback path absorbs read_binary_file failures inside the same
  try/catch so partial state can't leak.

Docs (chip-level-arch, dynamic-linking, logging, testing,
python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings,
init-flow ASCII art, and "configuration flow" table to reflect new shape.

Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e .  (all 4 host_runtime.so + libsimpler_log compile)
- pytest tests/ut/py  (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated)
- examples/workers/l2/{hello_worker, worker_malloc} on both sims

Onboard ut + st coverage runs in CI (Linux).
@ChaoWao ChaoWao force-pushed the refactor/hoist-executors-out-of-run-runtime branch from 7b7c60d to 14ad620 Compare May 11, 2026 12:38
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 12, 2026
…e/run_prepared

Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands:

(1) Logger ownership moves entirely to libsimpler_log.so.

Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log
        for HostLogger setup, then host_runtime.so cached log_level /
        log_info_v on every DeviceRunner so run_prepared could later forward
        them to AICPU. Log state lived in three places (HostLogger, runner
        member, KernelArgs) all seeded off the same C-ABI argument.
After:  libsimpler_log.so exports its own simpler_log_init(level, info_v) C
        entry, called from ChipWorker::init BEFORE host_runtime.so is even
        dlopened. HostLogger gains level()/info_v() raw getters. Every
        consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads
        from HostLogger::get_instance() directly. Log state is owned in
        exactly one place; no log argument ever travels through the
        host_runtime.so C ABI.

(2) Executor binaries hoisted out of every prepare_callable / run_prepared
    call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable
    accepted and ignored them; run_prepared still threaded them per-launch
    via the C ABI just to reach DeviceRunner::run). Now:

Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size,
                         aicore*, aicore_size)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...)
After:  simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size)
          attach + executor takeover + (onboard) dlog sync
        prepare_callable(ctx, cid, callable)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     enable_l2_swimlane, enable_dump_tensor, enable_pmu,
                     output_prefix)

ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the
bytes are read once in init() and transferred into DeviceRunner-owned
vectors via simpler_init.

libsimpler_log.so:
- HostLogger gains `int level() const` / `int info_v() const` raw getters.
- New C export `simpler_log_init(int log_level, int log_info_v)` validates
  and forwards to HostLogger setters.

host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls):
- `simpler_init` signature: drops `log_level`/`log_info_v`; adds
  `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach +
  runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`.
- `prepare_callable` signature: drops `device_id` + binary pointers (the
  upstream impl already ignored them).
- `run_prepared` signature: drops `device_id` + binary pointers. Body
  reads `runner->device_id()` for the `prepare_run_context` call; runner
  binaries are already loaded.

4 × DeviceRunner.{h,cpp}:
- New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`.
- New `set_executors(aicpu, aicore)` setter (move-in, called from
  simpler_init); new `device_id() const` getter.
- `run()` signature loses `device_id` + binary vectors; reads from members.
- `ensure_device_initialized()` / `ensure_binaries_loaded()` argless.
- Drop `log_level_` / `log_info_v_` members + `set_log_level` /
  `set_log_info_v` setters. Onboard `run()` reads
  `HostLogger::get_instance().level() / .info_v()` directly when
  populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` /
  `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is
  RTLD_GLOBAL so the AICPU sim SO resolves it the same way.

ChipWorker:
- Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the
  existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is
  opened so any LOG_* macro firing during host_runtime's dlopen-time
  constructors already sees the right level.
- Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read
  the binary bytes into local vectors that simpler_init moves into the
  runner, then drop the locals (no per-ChipWorker binary cache).
- `prepare_callable` / `run_prepared` calls drop the binary args.
- init()'s rollback path absorbs read_binary_file failures inside the same
  try/catch so partial state can't leak.

Docs (chip-level-arch, dynamic-linking, logging, testing,
python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings,
init-flow ASCII art, and "configuration flow" table to reflect new shape.

Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e .  (all 4 host_runtime.so + libsimpler_log compile)
- pytest tests/ut/py  (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated)
- examples/workers/l2/{hello_worker, worker_malloc} on both sims

Onboard ut + st coverage runs in CI (Linux).
@ChaoWao ChaoWao force-pushed the refactor/hoist-executors-out-of-run-runtime branch from 14ad620 to 4162d49 Compare May 12, 2026 00:41
…e/run_prepared

Rebase of hw-native-sys#735 on top of hw-native-sys#710 (prepared-callable framework). Two strands:

(1) Logger ownership moves entirely to libsimpler_log.so.

Before: simpler_init in host_runtime.so reached cross-SO into libsimpler_log
        for HostLogger setup, then host_runtime.so cached log_level /
        log_info_v on every DeviceRunner so run_prepared could later forward
        them to AICPU. Log state lived in three places (HostLogger, runner
        member, KernelArgs) all seeded off the same C-ABI argument.
After:  libsimpler_log.so exports its own simpler_log_init(level, info_v) C
        entry, called from ChipWorker::init BEFORE host_runtime.so is even
        dlopened. HostLogger gains level()/info_v() raw getters. Every
        consumer (host_runtime, AICPU forwarding, CANN dlog sync) reads
        from HostLogger::get_instance() directly. Log state is owned in
        exactly one place; no log argument ever travels through the
        host_runtime.so C ABI.

(2) Executor binaries hoisted out of every prepare_callable / run_prepared
    call. They were already conceptually one-shot (hw-native-sys#710's prepare_callable
    accepted and ignored them; run_prepared still threaded them per-launch
    via the C ABI just to reach DeviceRunner::run). Now:

Before: prepare_callable(ctx, cid, callable, device_id, aicpu*, aicpu_size,
                         aicore*, aicore_size)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     device_id, aicpu*, aicpu_size, aicore*, aicore_size, ...)
After:  simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size)
          attach + executor takeover + (onboard) dlog sync
        prepare_callable(ctx, cid, callable)
        run_prepared(ctx, runtime, cid, args, block_dim, aicpu_thread_num,
                     enable_l2_swimlane, enable_dump_tensor, enable_pmu,
                     output_prefix)

ChipWorker no longer caches aicpu_binary_ / aicore_binary_ members; the
bytes are read once in init() and transferred into DeviceRunner-owned
vectors via simpler_init.

libsimpler_log.so:
- HostLogger gains `int level() const` / `int info_v() const` raw getters.
- New C export `simpler_log_init(int log_level, int log_info_v)` validates
  and forwards to HostLogger setters.

host_runtime.so C ABI (`pto_runtime_c_api.h` + 4 platform impls):
- `simpler_init` signature: drops `log_level`/`log_info_v`; adds
  `aicpu_binary*, aicpu_size, aicore_binary*, aicore_size`. Impl: attach +
  runner->set_executors() + (onboard) `dlog_setlevel(HostLogger.level())`.
- `prepare_callable` signature: drops `device_id` + binary pointers (the
  upstream impl already ignored them).
- `run_prepared` signature: drops `device_id` + binary pointers. Body
  reads `runner->device_id()` for the `prepare_run_context` call; runner
  binaries are already loaded.

4 × DeviceRunner.{h,cpp}:
- New `aicpu_so_binary_` member alongside existing `aicore_kernel_binary_`.
- New `set_executors(aicpu, aicore)` setter (move-in, called from
  simpler_init); new `device_id() const` getter.
- `run()` signature loses `device_id` + binary vectors; reads from members.
- `ensure_device_initialized()` / `ensure_binaries_loaded()` argless.
- Drop `log_level_` / `log_info_v_` members + `set_log_level` /
  `set_log_info_v` setters. Onboard `run()` reads
  `HostLogger::get_instance().level() / .info_v()` directly when
  populating KernelArgs. Sim drops the dlsym'd `set_log_level_func_` /
  `set_log_info_v_func_` from the AICPU sim SO entirely — HostLogger is
  RTLD_GLOBAL so the AICPU sim SO resolves it the same way.

ChipWorker:
- Dlsym `simpler_log_init` from libsimpler_log.so (now stashed in the
  existing `g_simpler_log_handle`); call it BEFORE host_runtime.so is
  opened so any LOG_* macro firing during host_runtime's dlopen-time
  constructors already sees the right level.
- Adapt the existing `simpler_init` dlsym to the new 6-arg signature; read
  the binary bytes into local vectors that simpler_init moves into the
  runner, then drop the locals (no per-ChipWorker binary cache).
- `prepare_callable` / `run_prepared` calls drop the binary args.
- init()'s rollback path absorbs read_binary_file failures inside the same
  try/catch so partial state can't leak.

Docs (chip-level-arch, dynamic-linking, logging, testing,
python/simpler/__init__.py, python/simpler/_log.py): updated ABI listings,
init-flow ASCII art, and "configuration flow" table to reflect new shape.

Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e .  (all 4 host_runtime.so + libsimpler_log compile)
- pytest tests/ut/py  (212 passed, 7 skipped; 4 pre-existing torch-missing failures unrelated)
- examples/workers/l2/{hello_worker, worker_malloc} on both sims

Onboard ut + st coverage runs in CI (Linux).
@ChaoWao ChaoWao force-pushed the refactor/hoist-executors-out-of-run-runtime branch from 4162d49 to 26a4684 Compare May 12, 2026 00:55
@ChaoWao ChaoWao merged commit 6273dd3 into hw-native-sys:main May 12, 2026
27 of 28 checks passed
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request May 12, 2026
…pWorker::init to Python

Continues the API-narrowing theme of hw-native-sys#723 / hw-native-sys#735. ChipWorker::init was the
last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so
and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling
libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up
into the Python `ChipWorker` wrapper, shrinking the C++ init signature from
8 args to 4.

Before:
  void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib,
                        device_id, sim_context_lib = "",
                        log_level = 1, log_info_v = 5);
After:
  void ChipWorker::init(host_lib, aicpu, aicore, device_id);

### Why this is safe

`_task_interface.so` (the nanobind module that contains chip_worker.cpp) has
no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches
host_runtime.so purely via dlsym, and the binding code itself doesn't log. So
the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of
host_runtime.so, not module import. The Python wrapper does exactly that:

  1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL)   # once per process
  2. <handle>.simpler_log_init(log_level, log_info_v)       # seed HostLogger
  3. if bins.sim_context_path:                              # sim only
       ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL)
  4. self._impl.init(host_path, aicpu_path, aicore_path, device_id)

A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads
idempotent per path — the Python counterpart of the C++ side's old
std::once_flag.

### Changes

src/common/worker/chip_worker.{h,cpp}:
- init() drops simpler_log_lib_path, sim_context_lib_path, log_level,
  log_info_v params.
- Remove the g_simpler_log_* / g_sim_context_* file-scope globals,
  ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the
  SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the
  simpler_log_init call. Drop the now-unused <mutex> include.
- init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create
  device ctx → read executor binaries → simpler_init.

python/bindings/task_interface.cpp:
- `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path,
  aicore_path, device_id).

python/simpler/task_interface.py:
- New module-level `_preloaded_globals` registry + `_preload_global(path)`
  helper (ctypes.CDLL RTLD_GLOBAL, one per path).
- ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the
  ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is
  set, then call the 4-arg _impl.init. Wrapper's public signature
  (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no
  caller updates needed.

tests/ut/py/test_chip_worker.py:
- The three `_ChipWorker.init(...)` fault-path tests drop the
  `/nonexistent/libsimpler_log.so` argument (no longer a parameter).

Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py,
python/simpler/_log.py): updated the init-flow ASCII art / load-order section
/ configuration-flow table to show the preload happening in the Python
wrapper before the C++ _ChipWorker.init.

Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e .
- pytest tests/ut/py  (119 passed, 7 skipped; torch-missing tests excluded as before)
- examples/workers/l2/{hello_worker, worker_malloc} on both sims

Onboard ut + st coverage runs in CI (Linux).
poursoul added a commit to poursoul/simpler that referenced this pull request May 12, 2026
Hoist `_ensure_prepared` out of the two chip-child loops into a single
module-level helper so the lazy/eager prepare branches stay in sync.
Add UT coverage for the lazy-prewarm fallback path, which previously
had none. Unify the `prepared_callable_path_used_` comment across all
four device_runner.h headers and note the legacy-path assumption it
rests on.

Clears the remaining follow-up items from PR hw-native-sys#710 / hw-native-sys#735.
ChaoWao added a commit that referenced this pull request May 12, 2026
…pWorker::init to Python (#746)

Continues the API-narrowing theme of #723 / #735. ChipWorker::init was the
last place in C++ doing process-wide SO bootstrap (dlopen libsimpler_log.so
and, on sim, libcpu_sim_context.so with RTLD_GLOBAL, plus calling
libsimpler_log.so's simpler_log_init to seed HostLogger). That work moves up
into the Python `ChipWorker` wrapper, shrinking the C++ init signature from
8 args to 4.

Before:
  void ChipWorker::init(host_lib, aicpu, aicore, simpler_log_lib,
                        device_id, sim_context_lib = "",
                        log_level = 1, log_info_v = 5);
After:
  void ChipWorker::init(host_lib, aicpu, aicore, device_id);

### Why this is safe

`_task_interface.so` (the nanobind module that contains chip_worker.cpp) has
no undefined HostLogger / unified_log_* symbols — chip_worker.cpp reaches
host_runtime.so purely via dlsym, and the binding code itself doesn't log. So
the RTLD_GLOBAL preload only has to precede the `_ChipWorker.init` dlopen of
host_runtime.so, not module import. The Python wrapper does exactly that:

  1. ctypes.CDLL(bins.simpler_log_path, mode=RTLD_GLOBAL)   # once per process
  2. <handle>.simpler_log_init(log_level, log_info_v)       # seed HostLogger
  3. if bins.sim_context_path:                              # sim only
       ctypes.CDLL(bins.sim_context_path, mode=RTLD_GLOBAL)
  4. self._impl.init(host_path, aicpu_path, aicore_path, device_id)

A module-level `_preloaded_globals: dict[str, ctypes.CDLL]` makes the loads
idempotent per path — the Python counterpart of the C++ side's old
std::once_flag.

### Changes

src/common/worker/chip_worker.{h,cpp}:
- init() drops simpler_log_lib_path, sim_context_lib_path, log_level,
  log_info_v params.
- Remove the g_simpler_log_* / g_sim_context_* file-scope globals,
  ensure_simpler_log_loaded(), ensure_sim_context_loaded(), the
  SimplerLogInitFn typedef + simpler_log_init_fn_ member, and the
  simpler_log_init call. Drop the now-unused <mutex> include.
- init()'s body is just: dlopen host_runtime.so RTLD_LOCAL → dlsym → create
  device ctx → read executor binaries → simpler_init.

python/bindings/task_interface.cpp:
- `_ChipWorker.init` nanobind def: 4 args (host_lib_path, aicpu_path,
  aicore_path, device_id).

python/simpler/task_interface.py:
- New module-level `_preloaded_globals` registry + `_preload_global(path)`
  helper (ctypes.CDLL RTLD_GLOBAL, one per path).
- ChipWorker.init: preload libsimpler_log.so + call simpler_log_init via the
  ctypes handle, preload libcpu_sim_context.so when bins.sim_context_path is
  set, then call the 4-arg _impl.init. Wrapper's public signature
  (device_id, bins, log_level=None, log_info_v=None) is unchanged, so no
  caller updates needed.

tests/ut/py/test_chip_worker.py:
- The three `_ChipWorker.init(...)` fault-path tests drop the
  `/nonexistent/libsimpler_log.so` argument (no longer a parameter).

Docs (chip-level-arch, dynamic-linking, logging, python/simpler/__init__.py,
python/simpler/_log.py): updated the init-flow ASCII art / load-order section
/ configuration-flow table to show the preload happening in the Python
wrapper before the C++ _ChipWorker.init.

Verified locally on a2a3sim + a5sim:
- pip install --no-build-isolation -e .
- pytest tests/ut/py  (119 passed, 7 skipped; torch-missing tests excluded as before)
- examples/workers/l2/{hello_worker, worker_malloc} on both sims

Onboard ut + st coverage runs in CI (Linux).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant