[Prompt registry] Phase 2 — Port static-tier sections behind the snap…#3659
[Prompt registry] Phase 2 — Port static-tier sections behind the snap…#3659VascoSch92 wants to merge 2 commits into
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The new static prompt registry works as an additive SDK API and matches legacy section content after documented gap normalization, but literal byte-for-byte parity with Agent.static_system_message is not true.
Does this PR achieve its stated goal?
Partially. I exercised the SDK as a user would by importing build_default_registry(), constructing real Agent/LLM objects, building prompt contexts, and comparing the registry output to the existing static prompt path across default, analyzer, sandbox-tier, browser-enabled, and Windows-simulated scenarios. Those scenarios matched after the PR's documented inter-section gap normalization and produced no dynamic block, so the static-tier port is functionally usable behind the snapshot. However, exact equality was False for the real default prompt (15187 vs 15191 bytes), so the PR does not literally satisfy the repeated byte-for-byte parity claim.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully. |
| CI Status | 🟡 21 checks passing, 8 pending, 2 skipped at review time; no checks were rerun. |
| Functional Verification |
Functional Verification
Test 1: Establish base behavior without the PR registry
Step 1 — Reproduce / establish baseline (without the fix):
Ran git worktree add --detach /tmp/qa-pr3659-base origin/main and then executed the SDK from that base worktree:
base_has_default_registry= False
base_static_len= 15191
base_starts= You are OpenHands agent, a helpful AI assistant that can interact with a computer to solve tasks.
base_has_role= True
base_has_browser= False
base_has_security_policy= True
This shows the base branch has the legacy Agent.static_system_message path but no openhands.sdk.context.prompts.default_registry entry point.
Step 2 — Apply the PR's changes:
Returned to PR head 3fd47bf95719a408dfc5132369a20d811c655da1 on vasco/static-part after bootstrapping with make build.
Step 3 — Re-run with the PR in place:
Executed real SDK agents and the new registry API:
CASE=default_gpt5
registry_static_len= 15187
legacy_static_len= 15191
exact_equal= False
canonical_equal= True
dynamic_is_none= True
has_security= True
has_security_risk= True
has_important= True
CASE=anthropic_with_analyzer
registry_static_len= 14489
legacy_static_len= 14493
exact_equal= False
canonical_equal= True
dynamic_is_none= True
has_security= True
has_security_risk= True
has_important= True
CASE=sandbox_risk_tiers
registry_static_len= 15139
legacy_static_len= 15143
exact_equal= False
canonical_equal= True
dynamic_is_none= True
has_security= True
has_security_risk= True
has_important= True
This shows the new API is usable and the static bodies match after gap normalization, but exact byte-for-byte parity with the live prompt path is not achieved.
Test 2: Browser, Windows, and edge-path behavior
Step 1 — Reproduce / establish baseline (without the fix):
On base, the new registry import was absent, so these sectionized paths could not be exercised there.
Step 2 — Apply the PR's changes:
Used the PR branch's new registry and section classes directly with SDK PromptContext/Agent objects.
Step 3 — Re-run with the PR in place:
Ran browser-enabled, Windows-simulated, and disabled-security-policy cases:
CASE=browser_enabled_agent
ctx_enable_browser= True
registry_has_browser= True
legacy_has_browser= True
canonical_equal= True
CASE=browser_guard_direct_context
disabled= False
enabled= True
CASE=windows_agent
ctx_platform= windows
registry_contains_powershell= True
registry_contains_execute_powershell= False
canonical_equal= True
CASE=security_policy_disabled
registry_static_len= 13674
legacy_static_len= 13703
exact_equal= False
canonical_equal= False
dynamic_is_none= True
has_security= False
has_security_risk= True
has_important= True
This confirms browser and Windows paths behave as expected after normalization. The disabled security policy divergence matches the PR notes and is not runtime-wired yet, but it reinforces that the new registry is not a literal byte-for-byte replacement for all legacy static prompt outputs.
Exact mismatch evidence
Ran a compact diff summary on the default PR prompt:
exact_equal= False
registry_len= 15187 legacy_len= 15191 delta= 4
first_diff_index= 9476
registry_excerpt= 'ser has explicitly requested and would expect\n\n</SECURITY>\n\n<SECURITY_RISK_ASSESSMENT>\n# Security Risk Policy\nWhen using tools that support '
legacy_excerpt= 'ser has explicitly requested and would expect\n\n</SECURITY>\n\n\n<SECURITY_RISK_ASSESSMENT>\n# Security Risk Policy\nWhen using tools that support'
This pinpoints the observed difference as inter-section blank-line spacing.
Issues Found
- 🟡 Minor: The new registry is functionally usable and section bodies match after normalization, but
registry.build(ctx).staticis not literally byte-for-byte equal toAgent.static_system_messagedespite the docstring/PR wording implying that level of parity.
This review was created by an AI agent (OpenHands) on behalf of the user.
|
|
||
| ``build_default_registry()`` registers the static-tier sections in the exact order | ||
| ``agent/prompts/system_prompt.j2`` emits them, so ``registry.build(ctx).static`` | ||
| reproduces ``AgentBase.static_system_message``. The dynamic-tier sections |
There was a problem hiding this comment.
🟡 Minor: QA execution against a real Agent showed registry.build(ctx).static is not byte-for-byte equal to Agent.static_system_message (default gpt-5: 15187 vs 15191 bytes; first difference is an extra legacy blank line before <SECURITY_RISK_ASSESSMENT>). Since this docstring says the registry reproduces the legacy message and the PR description emphasizes byte-for-byte parity, please either preserve exact inter-section spacing here or clarify that this is canonicalized/inter-section-normalized parity until the Phase 3 cutover.
This comment was created by an AI agent (OpenHands) on behalf of the user.
HUMAN:
This is pahse 2 of #3606. In this PR, we just port the static part of the system prompt inside the new structure.
AGENT:
Why
Part of the prompt-registry roadmap (#3606; proposal #2827), which replaces the monolithic
system_prompt.j2+ post-renderrefine()with a typed section registry whose static/dynamic cache split is declared per-section andunit-testable.
This PR is the static-tier half of Phase 2 (#3610): it ports the ~17 static blocks of
system_prompt.j2into typedPromptSectionclasses assembled by a default registry, soregistry.build(ctx).staticreproduces today'sprompt byte-for-byte. It is purely additive and lands behind the Phase 0 snapshot — nothing is wired into the runtime prompt yet (that is the Phase 3 cutover).
Summary
PromptSectionclasses (no Jinja) incontext/prompts/sections/static.py, ported verbatim fromsystem_prompt.j2, plusbuild_default_registry()(context/prompts/default_registry.py) registering them intemplate order.
registry.build(ctx).staticreproducesAgentBase.static_system_messagebyte-for-byte across the full Phase 0 matrix (model family × browser × security-analyzer × cli_mode) and the win32 cell; every section also has astandalone unit test (no Jinja environment).
system_prompt.j2, the snapshot oracle, or any runtime path.refine()(win32 shell-term swap) is applied only in the two blocks that actually contain shell terms.Issue Number
#3610 (roadmap #3606, proposal #2827)
How to Test
Library-only change with no runtime wiring yet (behind the snapshot), so the end-to-end evidence is the byte-for-byte equivalence test — it builds real
Agentinstances (not mocks), renders the realstatic_system_message,and asserts the registry reproduces it:
Regression (legacy render path untouched) + Phase 1:
Full lint/type gate:
Video/Screenshots
N/A — no UI or runtime surface (the registry is not wired into the prompt yet). Evidence is the test output above; the equivalence test exercises real
Agent/LLMobjects, not mocks.Type
Notes
static_system_messageis not yet routed through the registry — that is the Phase 3 cutover. This PR only proves equivalence.{% if %}blocks. The equivalence test normalizes only those</TAG>…3+ blanks…<TAG>boundaries (onetag-anchored regex), so every section body is asserted byte-for-byte. The literal whitespace shift lands at the Phase 3 cutover.
static.pycarries a file-level# ruff: noqa: E501for its verbatim long prompt lines;pyproject.tomlis untouched.<SECURITY>is guarded onsecurity_policy_filename(omitted when empty, vs. the template's empty tags); a customsecurity_policy_filenamewould resolve its contentinto the context (follow-up).
DateTime/RepoContext/AvailableSkills/CustomSuffix/CustomSecrets) fromsystem_message_suffix.j2.