Skip to content

Refactor: Arg dep API — primitive set_dependencies + ArgWithDeps<N> convenience layer#761

Merged
poursoul merged 2 commits into
hw-native-sys:mainfrom
jvjhfhg:refactor/length-limited-api
May 13, 2026
Merged

Refactor: Arg dep API — primitive set_dependencies + ArgWithDeps<N> convenience layer#761
poursoul merged 2 commits into
hw-native-sys:mainfrom
jvjhfhg:refactor/length-limited-api

Conversation

@jvjhfhg
Copy link
Copy Markdown
Collaborator

@jvjhfhg jvjhfhg commented May 12, 2026

Reworks the explicit-dependency API into two layers and lifts the previous hard cap on dependency count.

Primitive layer — Arg::set_dependencies(const PTO2TaskId*, uint32_t)

  • Takes a caller-owned dependency array (ptr + count) instead of variadic PTO2TaskIds, lifting the hard PTO2_MAX_EXPLICIT_DEPS = 16 runtime cap
  • Arg stores (ptr, count) without copying, matching add_input / add_output lifetime semantics — the caller's array must outlive the submit
  • count == 0 explicitly clears any stored deps, so conditionally-built dep arrays can pass through unguarded; count > 0 is single-shot to preserve the no-accumulation
    invariant
  • Drops ExplicitDepStorage struct, PTO2_MAX_EXPLICIT_DEPS macro, and the a2a3 runtime/dep_gen static_assertDEP_GEN_MAX_EXPLICIT_DEPS = 16 is now a diagnostic-only
    truncation cap, unchanged
  • Updates docs/manual-scope.md (API, rules, examples)

Convenience layer — ArgWithDeps<N> (default N=16)

A thin wrapper on top of the primitive layer that revives the previous add_dep(...) ergonomics for hand-written orchestration.

  • New header pto_arg_with_deps.h, auto-included at the bottom of pto_orchestration_api.h so orchestration sources still need only one #include
  • Private inheritance from Arg with selective using-declarations exposes the Arg setter surface (add_input / add_output / add_inout / add_no_dep / add_scalar* /
    has_error / error_msg / launch_spec) while keeping set_dependencies and the explicit_dep* accessors unreachable on a wrapper instance — users cannot accidentally
    mix the two dep APIs on the same object
  • Variadic add_dep(...) accumulates into a stack-sized buffer of capacity N; overflow reports an error on the underlying Arg ("bump the template arg")
  • reset() clears both layers; finalize_for_submit() is idempotent so a wrapper can be re-submitted without tripping the primitive layer's single-shot check
  • New rt_submit_task / rt_submit_aic_task / rt_submit_aiv_task overloads accept ArgWithDeps<N>& and call finalize_for_submit() transparently — no caller-visible
    finalize step
  • pypto-generated orchestration can ignore the convenience layer entirely and target the primitive set_dependencies(ptr, count) directly

Example migration

  • 4 paged_attention orchestrations (a2a3 / a5 × manual_scope / unroll_manual_scope) migrated to the primitive layer
  • paged_attention_manual_scope (both a2a3 and a5) further demonstrates both layers side-by-side: params_sf keeps Arg + set_dependencies, params_up switches to
    ArgWithDeps + add_dep, each with a comment marking its intended use case
  • tests/st/{a2a3,a5}/.../dummy_task (introduced by Feat: dummy_task — dep-only task that bypasses AICore dispatch #754) migrated to set_dependencies as part of the refactor

Verification

  • a5sim: paged_attention manual_scope + unroll — PASS
  • a2a3 hardware: paged_attention manual_scope (Case1, CaseSmall1), dummy_task (SingleDummyAutoDep, LongDummyChain, DummyExplicitDepBarrier) — PASS

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the Arg.add_dep method with Arg.set_dependencies, moving from a variadic addition model to a pointer-and-count array model for explicit task dependencies. This change removes the previous hard runtime limit on the number of dependencies per task and shifts storage ownership to the caller, requiring the dependency array to remain valid until the task is submitted. The update includes comprehensive changes to documentation, orchestration examples, and the runtime implementation for both a2a3 and a5 platforms. I have no feedback to provide.

@uv-xiao
Copy link
Copy Markdown
Contributor

uv-xiao commented May 13, 2026

@jvjhfhg What is the purpose of this API change? Is it intended to remove the PTO2_MAX_EXPLICIT_DEPS=16 runtime cap? I noticed that example code became longer with the new API—do you think this tradeoff is worthwhile?

@jvjhfhg
Copy link
Copy Markdown
Collaborator Author

jvjhfhg commented May 13, 2026

@jvjhfhg What is the purpose of this API change? Is it intended to remove the PTO2_MAX_EXPLICIT_DEPS=16 runtime cap? I noticed that example code became longer with the new API—do you think this tradeoff is worthwhile?

@uv-xiao pypto team reported an inconvenience on the too low dependency count limit when manually managing dependencies. Basically it IS to remove PTO2_MAX_EXPLICIT_DEPS limit.

In flexibility aspect, it's completely positive. But I do admit it could hurt convenience when hand-writing orchestration code. I'm considering providing a following porting struct and revive the add_dep api within this new struct. How do you think?

template <size_t MAX_DEP_COUNT = 16>
struct ArgWithDeps {
    PTO2TaskId deps[MAX_DEP_COUNT];
    int count;
    Arg arg;
};

@uv-xiao
Copy link
Copy Markdown
Contributor

uv-xiao commented May 13, 2026

@jvjhfhg What is the purpose of this API change? Is it intended to remove the PTO2_MAX_EXPLICIT_DEPS=16 runtime cap? I noticed that example code became longer with the new API—do you think this tradeoff is worthwhile?

@uv-xiao pypto team reported an inconvenience on the too low dependency count limit when manually managing dependencies. Basically it IS to remove PTO2_MAX_EXPLICIT_DEPS limit.

In flexibility aspect, it's completely positive. But I do admit it could hurt convenience when hand-writing orchestration code. I'm considering providing a following porting struct and revive the add_dep api within this new struct. How do you think?

template <size_t MAX_DEP_COUNT = 16>

struct ArgWithDeps {

    PTO2TaskId deps[MAX_DEP_COUNT];

    int count;

    Arg arg;

};

I see. I agree that if the orch code will mainly be generated by pypto, the handwriting convenience doesn't matter much actually. And the proposed ArgWithDeps also looks good for me.

Thanks!

jvjhfhg added 2 commits May 13, 2026 15:50
- Take a caller-owned dependency array (ptr + count) instead of variadic
  PTO2TaskIds; lifts the hard PTO2_MAX_EXPLICIT_DEPS=16 runtime cap
- Args stores (ptr, count) without copying, matching add_input/add_output
  lifetime semantics — the caller's array must outlive the submit
- count == 0 explicitly clears any stored deps, so conditionally-built
  dep arrays can pass through unguarded; count > 0 is single-shot to
  preserve the no-accumulation invariant
- Drop ExplicitDepStorage struct, PTO2_MAX_EXPLICIT_DEPS macro, and the
  a2a3 runtime/dep_gen static_assert (DEP_GEN_MAX_EXPLICIT_DEPS=16 is now
  a diagnostic-only truncation cap, unchanged)
- Migrate the four paged_attention orchestration examples to build the
  dep set on the stack and call set_dependencies once
- Update docs/manual-scope.md API, rules, and examples
- New header pto_arg_with_deps.h defines ArgWithDeps<N> (default N=16):
  private inheritance from Arg so set_dependencies/explicit_dep* stay
  hidden, with selective using-declarations exposing the Arg setter
  surface plus a variadic add_dep(...) that accumulates into a stack
  buffer; reset() clears both layers; finalize_for_submit() binds the
  buffer back via set_dependencies(ptr, count) and is idempotent so a
  wrapper can be re-submitted without tripping the single-shot check
- pto_orchestration_api.h auto-includes the wrapper header at the
  bottom so orchestration sources keep a single include
- rt_submit_task / rt_submit_aic_task / rt_submit_aiv_task gain
  overloads that accept ArgWithDeps<N>& and call finalize_for_submit()
  transparently, no caller-visible finalize step
- Demonstrate both layers side-by-side in paged_attention_manual_scope
  (a2a3 and a5): params_sf keeps the primitive Arg+set_dependencies
  form, params_up switches to ArgWithDeps+add_dep, with comments
  marking each pattern's intended use case
@jvjhfhg jvjhfhg force-pushed the refactor/length-limited-api branch from 1cda8c8 to 8ac1c83 Compare May 13, 2026 07:54
@jvjhfhg jvjhfhg changed the title Refactor: replace Arg.add_dep with set_dependencies(deps, count) Refactor: Arg dep API — primitive set_dependencies + ArgWithDeps<N> convenience layer May 13, 2026
@poursoul poursoul merged commit 9e6e5e2 into hw-native-sys:main May 13, 2026
14 checks passed
@jvjhfhg jvjhfhg deleted the refactor/length-limited-api branch May 13, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants