fix(lark-mail): clarify JSON field semantics in confirm vs execute#798
fix(lark-mail): clarify JSON field semantics in confirm vs execute#798xzcong0820 wants to merge 1 commit intolarksuite:mainfrom
Conversation
… vs execute Verify run on PR larksuite#749 caught a code_bug at MAIL-PROMPT-DELETE-NEEDS-CONFIRM-01: during ask_confirm, the model leaked execute-phase commitment into `planned_action` (string or dict form), and `preview.fields` sometimes collapsed into `preview.items` instead of being listed explicitly. Root cause: the new "写操作前显式确认" section described the semantics (preview-then-confirm) but did not specify how those map to the runner's JSON output schema fields, so the model had no anchor for which fields belong to which decision stage. Fix: append a "决策状态机字段语义" subsection in both files, spelling out field-by-field constraints for `decision = ask_confirm` / `execute` / `report_not_found`: - ask_confirm: planned_action MUST be null; preview.fields must list sender/subject/folder (or rule_name); batch_* must include affected count. - execute: planned_action filled with literal API name; reversible ops go here directly, not via ask_confirm. - report_not_found: target_found=false, planned_action=null. Scope per verify report: only SKILL prompt files touched; runner.py / judge.py / scenarios/*.json untouched.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughTwo documentation files are updated in parallel to define decision state machine field semantics for external runners outputting JSON. The changes establish a contract specifying three decision modes ( ChangesMail Decision State Machine Contract
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #798 +/- ##
=======================================
Coverage 65.46% 65.46%
=======================================
Files 510 510
Lines 47129 47129
=======================================
Hits 30851 30851
Misses 13607 13607
Partials 2671 2671 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
🚀 PR Preview Install Guide🧰 CLI updatenpm i -g https://pkg.pr.new/larksuite/cli/@larksuite/cli@56e66aaf8cc5abc305075a7e20bd2931984a8a5b🧩 Skill updatenpx skills add xzcong0820/larksuite-cli#harness/01kr6z4sg2d03vtgy9f0h2twkf -y -g |
What
Follow-up fix on top of #749. Adds a "决策状态机字段语义" subsection to both
skills/lark-mail/SKILL.mdandskill-template/domains/mail.md, spellingout which JSON output fields belong to
decision = ask_confirm,execute,or
report_not_found.Why
Verify run on #749 caught a stable failure at scenario
MAIL-PROMPT-DELETE-NEEDS-CONFIRM-01(skill-prompt-eval, two consecutiveruns):
ask_confirm, the model leaked execute-phase commitment intoplanned_action(once as the string"batch_trash messages [m_1, m_2]",once as a dict
{"api": "messages.batch_trash", "message_ids": [...]})preview.fieldssometimes collapsed intopreview.itemsinstead ofbeing explicitly listed, so downstream couldn't see
sender / subject / folderwere extractedRoot cause: the new "写操作前显式确认" section in #749 described the
semantic rules (preview-then-confirm, reversible vs irreversible) but
did not anchor those rules to the runner's JSON output schema fields,
so the model had no signal for which fields belong to which decision stage.
Fix
Append a stage-by-stage field constraint table to both files (only the
two SKILL prompt files are touched —
runner.py/judge.py/scenarios/*.jsonare intentionally untouched):ask_confirm:planned_actionMUST benull;would_execute_write = false;preview.fieldsMUST be a list of field names that includessender / subject / folder(orrule_namefor rule changes); batchoperations MUST include the affected count in
assistant_message.execute:planned_actioncarries the literal API name (e.g."messages.batch_trash","messages.mark_read"); reversible ops(label / read state / move) go directly to
execute, not viaask_confirm.report_not_found:target_found = false,planned_action = null.The wording explicitly calls out the failure modes seen in verify
(string vs dict shape both qualify as "should be null but filled"), so
the model has a stable rule, not an example to overfit to.
Scope
skills/lark-mail/SKILL.md(+26 lines under the existing"## 数据真实性与操作合规 / 2. 写操作前显式确认" section)
skill-template/domains/mail.md(+26 lines, same content for thedomain prompt template)
Related
MAIL-PROMPT-DELETE-NEEDS-CONFIRM-01(verify reportoutput/cli/cases.jsonl)Summary by CodeRabbit
Release Notes