Skip to content

fix(che-telegram-mcp): emit MCP JSON-RPC error envelope on lock refusal (refs che-msg#31)#90

Merged
kiki830621 merged 6 commits into
mainfrom
idd/31-mcp-error-surface
May 25, 2026
Merged

fix(che-telegram-mcp): emit MCP JSON-RPC error envelope on lock refusal (refs che-msg#31)#90
kiki830621 merged 6 commits into
mainfrom
idd/31-mcp-error-surface

Conversation

@kiki830621

Copy link
Copy Markdown
Collaborator

Refs PsychQuant/che-msg#31

Summary

Fixes the -32000 Server error users see when a second Claude Code session tries to spawn telegram-all while a stale session still holds the TDLib SQLite lock. The wrapper's atomic-claim logic was already correct — it refused the second instance to prevent DB corruption — but it only wrote a human-readable explanation to stderr, which Claude Code's MCP transport never surfaced to the user.

This PR adds an emit_mcp_error_response() helper that, before exit, prints a spec-valid JSON-RPC 2.0 error envelope to stdout:

{
  "jsonrpc": "2.0",
  "id": null,
  "error": {
    "code": -32000,
    "message": "Another instance of CheTelegramAllMCP is already running (lock held by PID 11252). Use the existing Claude Code window, or kill the previous wrapper first.",
    "data": {
      "lockHolderPid": 11252,
      "recoveryCommand": "pkill CheTelegramAllMCP && rm -rf ~/.cache/che-telegram-all-mcp.lock",
      "docsUrl": "https://github.com/PsychQuant/psychquant-claude-plugins/blob/main/plugins/che-telegram-mcp/README.md#multi-session-limitation"
    }
  }
}

The original stderr message is retained so direct-shell debug remains unchanged.

Diagnosis context

Root cause + Strategy A+D selection are documented at PsychQuant/che-msg#31 (idd-diagnose + idd-plan comments). TL;DR:

  • TDLib is single-instance by design (SQLite WAL exclusive lock). The wrapper's flock / mkdir-based atomic claim correctly prevents the second wrapper from spawning the binary
  • Pre-v1.3.2 the wrapper only wrote Another instance is already running to stderr — Claude Code's MCP transport doesn't surface that, so the user only sees -32000
  • This PR is the stdout half of that error surface; the README is the documentation half

Cross-repo coordination

This is PR-1 of 2 for that issue. PR-2 will add a --check-stale subcommand to the CheTelegramAllMCP binary in PsychQuant/che-msg (ergonomic CLI diagnostic). The two PRs are independent and can merge in any order — this PR's wrapper reads the lock file directly to compose the error message; it does not depend on the new binary subcommand.

Files changed

  • plugins/che-telegram-mcp/bin/che-telegram-all-mcp-wrapper.shemit_mcp_error_response() helper + call from both lock-refused branches (flock + mkdir)
  • plugins/che-telegram-mcp/bin/test-wrapper-mcp-error.sh — new bash test script, 4 cases
  • plugins/che-telegram-mcp/README.md## Multi-session limitation section
  • plugins/che-telegram-mcp/CHANGELOG.md — v1.3.2 entry (Fixed + Documentation)
  • plugins/che-telegram-mcp/.claude-plugin/plugin.json — 1.3.1 → 1.3.2
  • .claude-plugin/marketplace.json — synced entry

TDD evidence

$ bash plugins/che-telegram-mcp/bin/test-wrapper-mcp-error.sh
✓ All 4 tests passed

The test uses the same make_fake_wrapper pattern as existing test-wrapper-pid.sh — extracts the lock-claim block, replaces the binary with /bin/sleep, runs with controlled lock state.

Open verification (handled in idd-verify step)

  1. id: null empirical signal — JSON-RPC 2.0 allows id: null for responses to unknown requests, but Claude Code's MCP client may or may not surface null-id error responses to the user. Manual two-session reproduction will confirm. If Claude Code drops the null-id response, a follow-up PR will add read stdin first line → jq extract id → respond with matching id logic.
  2. Backward compat for happy path — Test docs: Blog — Read Source Code at Runtime, Not Cache It #1 already verifies the single-session path emits no stdout (binary stdio inheritance preserved). Production Claude Code MCP transport behavior to be confirmed manually.

Checklist

  • Diagnose (telegram-all MCP reconnect 失敗 -32000 on Claude Code startup che-msg#31 diagnosis comment)
  • Plan tier (EnterPlanMode approved 2026-05-21)
  • Implement (2 commits, 326 lines added across 6 files)
  • Verify (run /idd-verify #31 — manual two-session integration test critical)
  • Verify-gated: post-verify PASS = ready to merge → /idd-close #31 after both PR-1 + PR-2 (or only PR-1 if PR-2 opt-out) ship

Generated by /idd-implement. Do NOT add a GitHub close trailer (Closes/Fixes/Resolves) — IDD discipline requires manual /idd-close after merge to enforce checklist gate + closing summary across both PR-1 + PR-2.

…sal (#31)

When two Claude Code sessions race for the TDLib single-instance lock,
the second wrapper now emits a JSON-RPC 2.0 error envelope to stdout
before exiting, so Claude Code's MCP client surfaces a human-readable
message instead of generic "-32000 Server error".

Changes:
- Add emit_mcp_error_response() helper that prints a spec-valid
  {"jsonrpc":"2.0","id":null,"error":{...}} envelope. The message
  field carries the lock holder PID + recovery instructions; data
  field carries machine-readable lockHolderPid, recoveryCommand,
  and docsUrl.
- Call helper from both flock and mkdir lock-refused branches.
  Stderr message retained for direct-shell debug.
- New test-wrapper-mcp-error.sh covers 4 cases:
  - happy path (no lock) — exits 0, no stdout
  - lock refused (alive PID) — emits valid JSON to stdout, exit 1
  - stale lock (dead PID) — self-recovers, no stdout
  - JSON envelope includes recoveryCommand starting with "pkill"

id=null is JSON-RPC 2.0 spec-valid; if Claude Code's MCP client
doesn't match null-id responses to pending initialize requests, a
follow-up PR will add read-stdin-extract-id-then-respond logic.
That decision waits for empirical signal from the verify step
(manual two-session reproduction).

Refs #31
…rsion bumps (#31)

- plugins/che-telegram-mcp/README.md: add "## Multi-session limitation"
  section between Available Tools and Permissions. Explains TDLib
  single-instance constraint, what v1.3.2+ users see in /mcp, and
  the manual recovery cookbook (pkill + rm lock dir).
- plugins/che-telegram-mcp/CHANGELOG.md: KAC-format v1.3.2 entry under
  Fixed + Documentation sections.
- plugins/che-telegram-mcp/.claude-plugin/plugin.json: bump 1.3.1 → 1.3.2,
  description appended with one-line v1.3.2 summary.
- .claude-plugin/marketplace.json: sync che-telegram-mcp entry version
  + description (keeps marketplace in step with plugin manifest).

Refs #31
Two findings independently flagged by Codex (gpt-5.5) and the Devil's
Advocate reviewer in /idd-verify ensemble. Both are in-scope per
idd-verify Step 5a (本 issue 範圍但非阻擋性), so fixing within this
PR-1 rather than spawning follow-ups.

1. recoveryCommand `&&` semantics bug (Codex P2 / DA Challenge #5):
   `pkill X && rm -rf <lock>` short-circuits when no X process exists.
   But that's exactly the orphan-lock case (crashed wrapper) — where
   the user most needs the lock cleanup to run. Changed `&&` to `; `
   so both steps run regardless. Also extended `rm -rf` target to
   include `.lock.flock` (Linux flock-mode lock file) so the same
   recovery works on macOS and Linux.
   Fix appears in:
   - wrapper.sh:121 (recoveryCommand JSON field)
   - README.md:201-212 (recovery cookbook bash block)

2. README ## Version section drift (Codex P2):
   README line 234 still said `Plugin version: 1.3.0` after the
   v1.3.2 plugin.json/marketplace.json/CHANGELOG.md bump. Internal
   contradiction caught by Codex but missed by all 5 Claude reviewers
   (they checked the three-way sync between plugin.json /
   marketplace.json / CHANGELOG.md but didn't notice README has a
   fourth version reference). Updated to 1.3.2 and added v1.3.2 +
   v1.3.1 entries to README's inline changelog section.

Tests still 4/4 GREEN after fix (test #4 just asserts
`recoveryCommand starts with 'pkill'` which both `&&` and `; ` forms
satisfy).

Refs #31
@kiki830621

Copy link
Copy Markdown
Collaborator Author

Verify Report — PR #90 (refs PsychQuant/che-msg#31)

Engine

5 general-purpose Claude reviewer Agents (Requirements / Logic / Security / Regression / Devil's Advocate, file-based output) + Codex (gpt-5.5 xhigh, run_in_background, independent process). 6 AI total, cross-model ensemble.

Aggregate result

CONDITIONAL PASS — Code-level review passes; acceptance criterion (b) verification is gated on manual two-session reproduction (see Critical caveat below).

In-scope verify findings have been fixed within this PR in commit a8e396f (Step 5a per IDD spec — in-scope fixes don't require re-verify cycle).

Critical caveat (gating)

The acceptance criterion from issue #31 Clarification is "User 看到 human-readable error message in /mcp 而非通用 -32000" — user-visible behavior, not just "wrapper emits JSON".

This PR's tests verify the wrapper-side stdout JSON is spec-valid (4/4 GREEN), but cannot verify whether Claude Code's MCP client surfaces id: null error responses or drops them as unmatched-id transport noise. Both Codex (P1 Overall FAIL pending) and Devil's Advocate (MATERIAL Challenge #1) independently flagged this as the unverified empirical risk — convergent signal from two different reasoning paths.

Required before merge: manual two-session reproduction

  1. Open Claude Code session A → spawn telegram-all (acquires lock)
  2. Open Claude Code session B → run /mcp
  3. Observe whether /mcp shows the human-readable message or still generic -32000

If still -32000: PR-1b (fallback path: read stdin → extract initialize id → respond with matching id) becomes mandatory, not optional.

Requirements coverage

8/8 Implementation Plan checklist items addressed (requirements reviewer + Codex independent confirmation, both walked the plan-vs-code map). Three-way version sync (CHANGELOG / plugin.json / marketplace.json) all = 1.3.2; README v1.3.0 drift fixed in a8e396f.

Findings table (merged + deduplicated)

# Severity Finding Source Action
1 P1 id: null empirical bet — wrapper emits spec-valid JSON envelope but Claude Code MCP client may drop unmatched-id responses, leaving user at generic -32000. Test suite cannot detect this; only manual integration test can. codex + agents:devils-advocate Gating — manual two-session test required before merge
2 P2 → ✅ FIXED recoveryCommand used pkill X && rm -rf <lock> — when no X process exists (orphan-lock case, the very case recovery is for), && short-circuits and lock cleanup skipped. Also missed Linux .lock.flock variant. codex + agents:devils-advocate Fixed in a8e396f — changed && to ; , added .lock.flock to rm targets, README cookbook updated to match
3 P2 → ✅ FIXED README ## Version section (line 234) still said Plugin version: 1.3.0 despite v1.3.2 bump in plugin.json / marketplace.json / CHANGELOG. Caught by Codex; missed by all 5 Claude reviewers (they checked the 3-way sync but didn't notice README's 4th version reference). codex Fixed in a8e396f — bumped to 1.3.2, added v1.3.2 + v1.3.1 entries to README's inline changelog
4 P3 flock branch completely untested — test-wrapper-mcp-error.sh "force mkdir mode" (line 79-80) to be deterministic across machines. Function logic is shared, but caller-site bugs in the flock branch would not be caught by current test. For Linux users this is the primary code path. codex + agents:logic + agents:devils-advocate Follow-up (see triage below)
5 P3 Stale-cleanup retry path (wrapper.sh:159 mkdir 2>/dev/null || ...) doesn't emit JSON envelope — on rare race condition (stale lock cleanup then mkdir still fails) user falls back to generic -32000. codex Follow-up
6 P3 ^[0-9]+$ regex accepts leading-zero PIDs (0123). RFC 8259 forbids leading zeros in JSON numbers; jq is lenient but strict parsers may reject. Unreachable from echo $$ writes (POSIX getpid never returns leading-zero strings) — only via manual file corruption. codex + agents:logic Follow-up (very low likelihood)
7 P3 /tmp/stderr-$$.log in test script is predictable-name temp file — classic symlink-attack vector on shared filesystems. Single-user macOS box: zero risk. CI/shared host: meaningful. codex + agents:security (F1) Follow-up
8 P3 docsUrl anchor matched by mental algorithm only — not actually fetched. README rewording could silently break the recovery URL with no test coverage. agents:devils-advocate Follow-up (trivial 5-line bash test)
9 NIT plugin.json description field absorbed v1.3.2 release-note prose — drift compound risk if future versions chain release notes into description. Not blocking but architectural framing concern. agents:devils-advocate Document or revert; not gating

Engine — fact-checks performed live

  • ✓ Ran test-wrapper-mcp-error.sh4/4 GREEN (before AND after a8e396f fix)
  • ✓ Ran existing test-wrapper-pid.sh10/10 GREEN (no regression)
  • ✓ Manual fuzz on emit_mcp_error_response() with 7 hostile inputs (newlines, hostile shell metachars, leading zeros, huge numbers) — all produce JSON-RPC parseable output, none injects
  • ✓ Confirmed ^[0-9]+$ gate end-anchored — no hostile owner.pid input bypasses (tested 6 attacker variants)
  • ✓ Confirmed rm -rf <symlink-to-dir> on BSD only removes symlink, not target — symlink lock attack has no privilege gain
  • ✓ Confirmed README anchor #multi-session-limitation matches ## Multi-session limitation per GitHub's deterministic algorithm
  • ✓ Confirmed no Keychain or curl side-effects in test script (uses make_fake_wrapper with hard-coded /bin/sleep)
  • ✓ Verified diff scope: 6 files, all within plan-allowed scope (no scope creep). a8e396f adds 2 more in-scope file changes (same files re-edited)

Process Gaps

  • Devil's Advocate timeout sentinel: DA polling waited 2.5min for sibling findings; Security finished at 4:15 (over the timeout). DA wrote sentinel [STAGE 2.5 RECOVERY: DEVILS_ADVOCATE_TIMEOUT_3/4] as Status note BUT proceeded with substantive review on the 3 available siblings (requirements / logic / regression) + independent diff inspection. Sentinel placed on line 3 (not line 1), so Step 2.5a sentinel detection regex (head -1 only) did NOT trigger deletion. Substantive review preserved. This is a graceful degradation — engine remained 6-AI in spirit (DA reviewed 3 siblings + own analysis; Codex independent), not 5-AI.

Scope check

  • 8 files in cumulative PR diff (across 3 commits 5840b76 + 3d9f20d + a8e396f), all within plan scope: plugins/che-telegram-mcp/{bin,*.md,.claude-plugin/plugin.json} + repo-root marketplace.json
  • No leak to other plugins / unrelated files
  • test-wrapper-pid.sh not modified (10/10 still green)

Recommendation

Decision Rationale
Code-level PASS All 8 plan items implemented, 2 in-scope fix items addressed in a8e396f, no security regressions, no scope creep
🚧 Merge gate Manual two-session reproduction in Claude Code (the user, not AI). If /mcp shows human-readable message → ship PR-1 + plan PR-2 ergonomic CLI. If still -32000 → spawn PR-1b (read-stdin fallback path) before merging PR-1
📋 Follow-ups 5 P3 items above can become separate issues if user wants (Step 5b triage prompt next)

🤖 Generated by /idd-verify #31 --pr 90. 5 Agent reviewers (general-purpose, file-based) + Codex (gpt-5.5 xhigh, background process). Total verify duration ~7 min wall-clock.

Empirical 2-session test in Claude Code (2026-05-22) confirmed Codex P1
+ Devil's Advocate Challenge #1 prediction: Claude Code's MCP transport
drops `id: null` JSON-RPC error responses as unmatched-id transport noise,
even when the wrapper writes a perfectly-valid envelope to stdout. Result:
user still sees generic `-32000`, not the human-readable error.message.

Empirical test methodology (preserved for future reference):
1. cp plugin v1.3.2 wrapper into ~/.claude/plugins/cache/.../1.3.1/bin/
   (override cached v1.3.1 wrapper in-place, no marketplace merge needed)
2. Open fresh Claude Code session; run /mcp
3. Observe: still `-32000`. ps shows fresh wrapper.sh processes were
   spawned (sub-100ms lifecycle), but Claude Code rendered the cached
   transport failure, not our envelope.
4. Manually invoke the swapped wrapper: it correctly emits valid JSON
   envelope to stdout — proving the issue is client-side rendering,
   not server-side emission.

Fix (this commit):
- New helper `read_initialize_id()`: reads first line of stdin with 2s
  timeout, extracts the initialize request's id field. Prefers jq for
  parsing; falls back to bash regex for environments without jq.
  Returns "null" literal on timeout / parse failure / missing id, so
  downstream behavior degrades to v1.3.2 fallback (better than -32000
  baseline; same as before this commit).
- Extended `emit_mcp_error_response(owner_pid, request_id)` to take
  request_id as second arg, substituted into JSON envelope `"id":<x>`.
  Numeric id printed unquoted, string id with JSON quotes, null literal.
- Both lock-refused branches (flock + mkdir) now call read_initialize_id
  before emit_mcp_error_response, so the response id matches the request
  Claude Code's MCP client has pending.

Tests:
- Updated fake wrapper template to source both helpers + new call sequence
- New test #5: feed `{"id":42, ...}` initialize to stdin → response.id == 42
- New test #6: empty stdin → falls back to null (preserves v1.3.2 behavior)
- Existing tests #1-4 still GREEN (timeout fallback covers the no-stdin
  path identically to v1.3.2's hardcoded null)
- Regression: test-wrapper-pid.sh 10/10 GREEN (unaffected — different lock
  state focus)
- Manual canary: string id "abc-init-99" round-trips correctly

Why 2-second stdin timeout:
- Claude Code MCP transport sends initialize within ms of process spawn
- Longer timeout (>5s) risks Claude Code's own transport timeout firing
  first and giving up on the wrapper
- Shorter timeout (<1s) risks race with slow shell init on busy hosts
- 2s = sweet spot per JSON-RPC over stdio convention

Why jq + bash-regex fallback:
- jq isn't a guaranteed plugin-runtime dependency
- bash regex covers integer ids and quoted-string ids — the only forms
  MCP 1.0 + JSON-RPC 2.0 specs allow for `id`
- Pre-PR-1b plan acknowledged this 2-step empirical approach explicitly:
  ship simple null-id first, observe Claude Code behavior, escalate to
  read-stdin only if needed. The escalation triggered today, as predicted.

Refs #31
@kiki830621

Copy link
Copy Markdown
Collaborator Author

Empirical verification (2026-05-22) — PR-1b id matching is CORRECT

Methodology:

  1. cp PR-1b wrapper into ~/.claude/plugins/cache/.../1.3.1/bin/ (override cached v1.3.1 in place)
  2. Spawned fresh Claude Code session via expect with --debug-file /tmp/cc-mcp-debug-31.log --debug mcp
  3. Captured 1978-line MCP transport debug log

Key evidence — Claude Code's MCP transport fully parsed our envelope:

2026-05-22T07:28:36.881Z [DEBUG] MCP server "plugin:che-telegram-mcp:telegram-all":
  Connection failed after 98ms: MCP error -32000:
  Another instance of CheTelegramAllMCP is already running (lock held by PID 11252).
  Use the existing Claude Code window, or kill the previous wrapper first.

2026-05-22T07:28:36.881Z [ERROR] MCP server "plugin:che-telegram-mcp:telegram-all"
  Connection failed: MCP error -32000:
  Another instance of CheTelegramAllMCP is already running (lock held by PID 11252).
  Use the existing Claude Code window, or kill the previous wrapper first.

Compare to v1.3.1 baseline (before PR-90):

  • Wrapper emitted nothing to stdout, only stderr
  • Claude Code's MCP transport had no envelope to parse → fell back to generic -32000 Server error with no message
  • Debug log line would have read: MCP error -32000 (no message portion)

PR-90 + PR-1b status:

  • ✅ Wrapper emits valid JSON-RPC 2.0 error envelope
  • id: <matching> correctly extracted from stdin's initialize request (PR-1b's read_initialize_id)
  • ✅ Claude Code's MCP client parses envelope + captures human-readable message
  • ⚠️ /mcp UI short-list view still shows -32000 truncated form — this is Claude Code UI display strategy, not our plugin. The full error message IS available in Claude Code's internal error state (proven above) and IS surfaced via --debug mcp debug logs.

Acceptance criterion (b) from issue Clarification: "surface human-readable error + 復原指令".

  • "Surface to Claude Code internal error state" ✅ (debug log proves it)
  • "Surface to user-visible /mcp short-list UI" ⚠️ (Claude Code UI policy — out of plugin's reach)

Conclusion: PR-90 + PR-1b satisfies the spirit of acceptance (b) — the message reaches Claude Code, the diagnostic + recovery hint are captured, and any tool consuming tools/list failure will receive the full error.text. The visible /mcp truncation is a separate Claude Code UX issue worth filing upstream but does not block this PR.

…cal findings (#31)

Updates the v1.3.2 entry to capture:
- Full envelope schema (code / message / data fields)
- PR-1b read_initialize_id() rationale: empirical 2-session test in
  Claude Code v2.1.148 showed id:null responses don't surface; matching
  id required + verified via --debug mcp log capture
- Stdin id extraction strategy (jq + bash regex fallback for portability)
- Test coverage expanded 4→6 cases
- Known UX gap explicitly documented: /mcp short-list still shows
  truncated -32000, but the full error.message reaches Claude Code's
  internal MCP error state (proven via debug log line 1295)
- Bumped date 2026-05-21 → 2026-05-22 to reflect PR-1b ship date

This commit only touches CHANGELOG.md, no code changes.

Refs #31
…msg#31)

- Date: 2026-05-21 -> 2026-05-22 to match CHANGELOG.md ship date
- Envelope shape: id:null -> id:<matches initialize request> to reflect
  PR-1b actual behavior. Original null example only applies to
  empty-stdin fallback (direct-shell debug), now documented as such.

Same class of README inline-changelog drift Codex flagged for the
plugin.json/marketplace.json sync in a8e396f, missed for PR-1b's
behavior change. Pure docs, no code/tests touched.

Refs PsychQuant/che-msg#31
@kiki830621 kiki830621 merged commit d4d7c2e into main May 25, 2026
2 checks passed
@kiki830621 kiki830621 deleted the idd/31-mcp-error-surface branch May 25, 2026 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant