Skip to content

docs(patrol): 2026-06-09 文档维护#694

Closed
aaronwong1989 wants to merge 1 commit into
hrygo:mainfrom
aaronwong1989:docs/patrol-2026-06-09
Closed

docs(patrol): 2026-06-09 文档维护#694
aaronwong1989 wants to merge 1 commit into
hrygo:mainfrom
aaronwong1989:docs/patrol-2026-06-09

Conversation

@aaronwong1989

Copy link
Copy Markdown
Collaborator

Summary

  • cron-design.md: 补充 Delivery 投递重试机制(内存队列、指数退避、临时/永久错误分类),更新权衡§5 Gateway 投递可靠性
  • metrics.md: 新增 hotplex.cron.delivery.result 指标(status: success/exhausted/permanent)

变更窗口

af9bc31f 以来 3 个提交,核心变更 5a23485e(feat(cron,ocs): OCS SystemPromptUpdater + cron delivery retry)

Closes #693

🤖 Generated with Claude Code

- cron-design.md: document in-memory retry queue (max 100, 3 attempts,
  30s→1m→2m backoff), update reliability tradeoff §5
- metrics.md: add hotplex.cron.delivery.result counter

🤖 Generated with [Claude Code](https://claude.com/claude-code)

@hotplex-ai hotplex-ai left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — hotplex-ai

Verdict: REQUEST_CHANGES | P0:0 P1:1 P2:3 P3:3

P1 — 重试参数描述与代码不一致

docs/explanation/cron-design.md:209 — "最多重试 3 次,指数退避 30s → 1m → 2m(上限 5m)" 与代码不符。

delivery.go:97 的条件是 attempt < maxRetryAttempts(3),实际流程:

  • attempt=1(首次投递)失败 → 排队重试,退避 30s
  • attempt=2(重试 1)失败 → 排队重试,退避 60s
  • attempt=3(重试 2)失败 → exhausted3 < 3 为 false)

代码注释 delivery.go:268 也明确写了:"actual retry window is ~90s (30s + 60s)"

建议改为:

最多 3 次投递尝试(含首次,即最多重试 2 次),指数退避 30s → 60s

P2 — 需改进

  1. docs/reference/metrics.md:169hotplex.cron.delivery.result 实际携带 {status, platform} 双标签(delivery.go:114-115,127-129),文档仅描述了 status,遗漏 platform。用户无法按 platform 维度做告警分析。

  2. docs/explanation/cron-design.md:209 — 临时性错误写 "5xx" 过于宽泛。errors.gohttpStatusRe 仅匹配 500/502/503/504,其他 5xx(如 501、507)被视为永久错误。建议精确列出或注明范围。

  3. docs/explanation/cron-design.md:342 — 权衡 #5 更新后与 #4 表述重叠,未明确区分 execution retry(scheduleRetry 修改 nextRunAtMs)和 delivery retry(内存重试队列)。读者可能混淆两类机制。

P3 — 建议

  1. docs/explanation/cron-design.md:209 — 优雅关闭时 flushPending 有 60s 共享超时预算(delivery.go:252-259),文档未提及此关闭延迟来源。

  2. docs/reference/metrics.md:169delivery.result 仅在 attempt > 1 时记录(delivery.go:124-125),首次投递成功不产生指标。建议注明"仅记录重试后的结果"。

  3. docs/reference/metrics.md:166hotplex.cron.errors 描述误写为"执行时长"(copy-paste 错误),实际应为"执行错误数,label: job_name, error_type"。非本 PR 引入,但建议顺手修正。

@aaronwong1989 aaronwong1989 dismissed hotplex-ai’s stale review June 8, 2026 22:43

Admin override: docs-only PR, merging into #695 batch

@aaronwong1989

Copy link
Copy Markdown
Collaborator Author

并入 PR #695,已 cherry-pick 至 batch 分支。

hotplex-ai pushed a commit that referenced this pull request Jun 8, 2026
* refactor(messaging/stt): extract trackPID/untrackPID helpers to eliminate cleanup duplication

- Extract trackPID() and untrackPID() from 3 inline copy-paste sites
- start(), shutdownProcess(), kill() now call shared helpers
- Eliminates risk of new cleanup paths forgetting PID tracking

Fixes #504

* refactor(worker/base): extract withProc/withProcCode helpers for Terminate/Kill/Wait lock pattern

- Extract withProc (error) and withProcCode (int, error) to eliminate
  repeated lock-get-nil-unlock pattern across 3 methods
- Terminate/Kill use withProc, Wait uses withProcCode
- Reduces 18 Lock/Unlock calls to 6, single source of truth for lock protocol

Fixes #511

* refactor(session/store): remove Compact from Store interface (ISP) + extract Upsert JSON helpers

- Remove Compact() from Store interface — it was never called via the
  interface; SQLiteStore retains it as a concrete method
- Remove pgStore Compact no-op
- Remove mockStore Compact stub
- Extract marshalSessionJSON() and upsertTimeout() to eliminate Upsert
  JSON serialization duplication between SQLite and PG backends

Fixes #508

* refactor(admin): extract apiKeyStoreBase to eliminate dual-store CRUD duplication

- Add apiKeyStoreBase with shared list/get/delete/Invalidator/SetInvalidator
- Add prepareQuery() for dialect rebind and ensureAPIKey() for key generation
- Add withLock func field to abstract write serialization (writeMu vs identity)
- SQLite apiKeyUserStore embeds base, keeps create/update (LastInsertId + app timestamps)
- PG pgStore embeds base, keeps create/update (RETURNING id + NOW())
- Eliminates ~50 lines of copy-pasted CRUD logic between backends

Fixes #516

* fix(admin): address P3 review findings from PR #695

P3-1: Remove dead Compact() from e2e mockStore (interface removed in 4623dbb)
P3-2: Remove dead writeMu field from apiKeyUserStore (absorbed into apiKeyStoreBase.withLock)
P3-3: Wrap PG create/update in withLock for consistency with delete

* docs(patrol): add cron delivery retry mechanism and metric (#694)

Closes #693

* refactor: DRY architecture follow-up improvements

admin/apikey: replace rebind/withLock function fields with dbutil.Dialect
and *sqlutil.WriteMu — eliminates nil-sentinel checks and uses existing
dialect-aware types from the codebase.

gateway/tests: remove dead Compact() and DeleteExpiredEvents() from
ctrl_test.go and conn_test.go mock stores (removed from Store interface
in 4623dbb).

messaging/tts: extract trackPID/untrackPID helpers on MossProcess,
consistent with stt.PersistentSTT pattern established in b7bb9871.

* style: gofmt

* refactor: inline prepareQuery, extract PID helpers, generify withProc

P3 code review optimizations:
- admin: remove prepareQuery wrapper, use dialect.Rebind() directly
  (consistent with session/cron/eventstore stores)
- proc: extract TrackPID/UntrackPID shared helpers from stt/tts/manager
- base: unify withProc/withProcCode into generic withProcResult[T]

Net: 6 files, -29 lines

---------

Co-authored-by: Sisyphus 🏔️ <sisyphus@mountain.forever>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs(patrol): 2026-06-09 文档维护

2 participants