fix(che-telegram-mcp): emit MCP JSON-RPC error envelope on lock refusal (refs che-msg#31)#90
Conversation
…sal (#31) When two Claude Code sessions race for the TDLib single-instance lock, the second wrapper now emits a JSON-RPC 2.0 error envelope to stdout before exiting, so Claude Code's MCP client surfaces a human-readable message instead of generic "-32000 Server error". Changes: - Add emit_mcp_error_response() helper that prints a spec-valid {"jsonrpc":"2.0","id":null,"error":{...}} envelope. The message field carries the lock holder PID + recovery instructions; data field carries machine-readable lockHolderPid, recoveryCommand, and docsUrl. - Call helper from both flock and mkdir lock-refused branches. Stderr message retained for direct-shell debug. - New test-wrapper-mcp-error.sh covers 4 cases: - happy path (no lock) — exits 0, no stdout - lock refused (alive PID) — emits valid JSON to stdout, exit 1 - stale lock (dead PID) — self-recovers, no stdout - JSON envelope includes recoveryCommand starting with "pkill" id=null is JSON-RPC 2.0 spec-valid; if Claude Code's MCP client doesn't match null-id responses to pending initialize requests, a follow-up PR will add read-stdin-extract-id-then-respond logic. That decision waits for empirical signal from the verify step (manual two-session reproduction). Refs #31
…rsion bumps (#31) - plugins/che-telegram-mcp/README.md: add "## Multi-session limitation" section between Available Tools and Permissions. Explains TDLib single-instance constraint, what v1.3.2+ users see in /mcp, and the manual recovery cookbook (pkill + rm lock dir). - plugins/che-telegram-mcp/CHANGELOG.md: KAC-format v1.3.2 entry under Fixed + Documentation sections. - plugins/che-telegram-mcp/.claude-plugin/plugin.json: bump 1.3.1 → 1.3.2, description appended with one-line v1.3.2 summary. - .claude-plugin/marketplace.json: sync che-telegram-mcp entry version + description (keeps marketplace in step with plugin manifest). Refs #31
Two findings independently flagged by Codex (gpt-5.5) and the Devil's Advocate reviewer in /idd-verify ensemble. Both are in-scope per idd-verify Step 5a (本 issue 範圍但非阻擋性), so fixing within this PR-1 rather than spawning follow-ups. 1. recoveryCommand `&&` semantics bug (Codex P2 / DA Challenge #5): `pkill X && rm -rf <lock>` short-circuits when no X process exists. But that's exactly the orphan-lock case (crashed wrapper) — where the user most needs the lock cleanup to run. Changed `&&` to `; ` so both steps run regardless. Also extended `rm -rf` target to include `.lock.flock` (Linux flock-mode lock file) so the same recovery works on macOS and Linux. Fix appears in: - wrapper.sh:121 (recoveryCommand JSON field) - README.md:201-212 (recovery cookbook bash block) 2. README ## Version section drift (Codex P2): README line 234 still said `Plugin version: 1.3.0` after the v1.3.2 plugin.json/marketplace.json/CHANGELOG.md bump. Internal contradiction caught by Codex but missed by all 5 Claude reviewers (they checked the three-way sync between plugin.json / marketplace.json / CHANGELOG.md but didn't notice README has a fourth version reference). Updated to 1.3.2 and added v1.3.2 + v1.3.1 entries to README's inline changelog section. Tests still 4/4 GREEN after fix (test #4 just asserts `recoveryCommand starts with 'pkill'` which both `&&` and `; ` forms satisfy). Refs #31
Verify Report — PR #90 (refs PsychQuant/che-msg#31)Engine5 general-purpose Claude reviewer Agents (Requirements / Logic / Security / Regression / Devil's Advocate, file-based output) + Codex (gpt-5.5 xhigh, run_in_background, independent process). 6 AI total, cross-model ensemble. Aggregate resultCONDITIONAL PASS — Code-level review passes; acceptance criterion (b) verification is gated on manual two-session reproduction (see Critical caveat below). In-scope verify findings have been fixed within this PR in commit Critical caveat (gating)The acceptance criterion from issue #31 Clarification is "User 看到 human-readable error message in This PR's tests verify the wrapper-side stdout JSON is spec-valid (4/4 GREEN), but cannot verify whether Claude Code's MCP client surfaces Required before merge: manual two-session reproduction
If still Requirements coverage8/8 Implementation Plan checklist items addressed (requirements reviewer + Codex independent confirmation, both walked the plan-vs-code map). Three-way version sync (CHANGELOG / plugin.json / marketplace.json) all = 1.3.2; README v1.3.0 drift fixed in Findings table (merged + deduplicated)
Engine — fact-checks performed live
Process Gaps
Scope check
Recommendation
🤖 Generated by |
Empirical 2-session test in Claude Code (2026-05-22) confirmed Codex P1 + Devil's Advocate Challenge #1 prediction: Claude Code's MCP transport drops `id: null` JSON-RPC error responses as unmatched-id transport noise, even when the wrapper writes a perfectly-valid envelope to stdout. Result: user still sees generic `-32000`, not the human-readable error.message. Empirical test methodology (preserved for future reference): 1. cp plugin v1.3.2 wrapper into ~/.claude/plugins/cache/.../1.3.1/bin/ (override cached v1.3.1 wrapper in-place, no marketplace merge needed) 2. Open fresh Claude Code session; run /mcp 3. Observe: still `-32000`. ps shows fresh wrapper.sh processes were spawned (sub-100ms lifecycle), but Claude Code rendered the cached transport failure, not our envelope. 4. Manually invoke the swapped wrapper: it correctly emits valid JSON envelope to stdout — proving the issue is client-side rendering, not server-side emission. Fix (this commit): - New helper `read_initialize_id()`: reads first line of stdin with 2s timeout, extracts the initialize request's id field. Prefers jq for parsing; falls back to bash regex for environments without jq. Returns "null" literal on timeout / parse failure / missing id, so downstream behavior degrades to v1.3.2 fallback (better than -32000 baseline; same as before this commit). - Extended `emit_mcp_error_response(owner_pid, request_id)` to take request_id as second arg, substituted into JSON envelope `"id":<x>`. Numeric id printed unquoted, string id with JSON quotes, null literal. - Both lock-refused branches (flock + mkdir) now call read_initialize_id before emit_mcp_error_response, so the response id matches the request Claude Code's MCP client has pending. Tests: - Updated fake wrapper template to source both helpers + new call sequence - New test #5: feed `{"id":42, ...}` initialize to stdin → response.id == 42 - New test #6: empty stdin → falls back to null (preserves v1.3.2 behavior) - Existing tests #1-4 still GREEN (timeout fallback covers the no-stdin path identically to v1.3.2's hardcoded null) - Regression: test-wrapper-pid.sh 10/10 GREEN (unaffected — different lock state focus) - Manual canary: string id "abc-init-99" round-trips correctly Why 2-second stdin timeout: - Claude Code MCP transport sends initialize within ms of process spawn - Longer timeout (>5s) risks Claude Code's own transport timeout firing first and giving up on the wrapper - Shorter timeout (<1s) risks race with slow shell init on busy hosts - 2s = sweet spot per JSON-RPC over stdio convention Why jq + bash-regex fallback: - jq isn't a guaranteed plugin-runtime dependency - bash regex covers integer ids and quoted-string ids — the only forms MCP 1.0 + JSON-RPC 2.0 specs allow for `id` - Pre-PR-1b plan acknowledged this 2-step empirical approach explicitly: ship simple null-id first, observe Claude Code behavior, escalate to read-stdin only if needed. The escalation triggered today, as predicted. Refs #31
|
Empirical verification (2026-05-22) — PR-1b id matching is CORRECT Methodology:
Key evidence — Claude Code's MCP transport fully parsed our envelope: Compare to v1.3.1 baseline (before PR-90):
PR-90 + PR-1b status:
Acceptance criterion (b) from issue Clarification: "surface human-readable error + 復原指令".
Conclusion: PR-90 + PR-1b satisfies the spirit of acceptance (b) — the message reaches Claude Code, the diagnostic + recovery hint are captured, and any tool consuming |
…cal findings (#31) Updates the v1.3.2 entry to capture: - Full envelope schema (code / message / data fields) - PR-1b read_initialize_id() rationale: empirical 2-session test in Claude Code v2.1.148 showed id:null responses don't surface; matching id required + verified via --debug mcp log capture - Stdin id extraction strategy (jq + bash regex fallback for portability) - Test coverage expanded 4→6 cases - Known UX gap explicitly documented: /mcp short-list still shows truncated -32000, but the full error.message reaches Claude Code's internal MCP error state (proven via debug log line 1295) - Bumped date 2026-05-21 → 2026-05-22 to reflect PR-1b ship date This commit only touches CHANGELOG.md, no code changes. Refs #31
…msg#31) - Date: 2026-05-21 -> 2026-05-22 to match CHANGELOG.md ship date - Envelope shape: id:null -> id:<matches initialize request> to reflect PR-1b actual behavior. Original null example only applies to empty-stdin fallback (direct-shell debug), now documented as such. Same class of README inline-changelog drift Codex flagged for the plugin.json/marketplace.json sync in a8e396f, missed for PR-1b's behavior change. Pure docs, no code/tests touched. Refs PsychQuant/che-msg#31
Refs PsychQuant/che-msg#31
Summary
Fixes the
-32000Server error users see when a second Claude Code session tries to spawntelegram-allwhile a stale session still holds the TDLib SQLite lock. The wrapper's atomic-claim logic was already correct — it refused the second instance to prevent DB corruption — but it only wrote a human-readable explanation to stderr, which Claude Code's MCP transport never surfaced to the user.This PR adds an
emit_mcp_error_response()helper that, before exit, prints a spec-valid JSON-RPC 2.0 error envelope to stdout:{ "jsonrpc": "2.0", "id": null, "error": { "code": -32000, "message": "Another instance of CheTelegramAllMCP is already running (lock held by PID 11252). Use the existing Claude Code window, or kill the previous wrapper first.", "data": { "lockHolderPid": 11252, "recoveryCommand": "pkill CheTelegramAllMCP && rm -rf ~/.cache/che-telegram-all-mcp.lock", "docsUrl": "https://github.com/PsychQuant/psychquant-claude-plugins/blob/main/plugins/che-telegram-mcp/README.md#multi-session-limitation" } } }The original stderr message is retained so direct-shell debug remains unchanged.
Diagnosis context
Root cause + Strategy A+D selection are documented at PsychQuant/che-msg#31 (idd-diagnose + idd-plan comments). TL;DR:
flock/mkdir-based atomic claim correctly prevents the second wrapper from spawning the binaryAnother instance is already runningto stderr — Claude Code's MCP transport doesn't surface that, so the user only sees-32000Cross-repo coordination
This is PR-1 of 2 for that issue. PR-2 will add a
--check-stalesubcommand to theCheTelegramAllMCPbinary inPsychQuant/che-msg(ergonomic CLI diagnostic). The two PRs are independent and can merge in any order — this PR's wrapper reads the lock file directly to compose the error message; it does not depend on the new binary subcommand.Files changed
plugins/che-telegram-mcp/bin/che-telegram-all-mcp-wrapper.sh—emit_mcp_error_response()helper + call from both lock-refused branches (flock + mkdir)plugins/che-telegram-mcp/bin/test-wrapper-mcp-error.sh— new bash test script, 4 casesplugins/che-telegram-mcp/README.md—## Multi-session limitationsectionplugins/che-telegram-mcp/CHANGELOG.md— v1.3.2 entry (Fixed + Documentation)plugins/che-telegram-mcp/.claude-plugin/plugin.json— 1.3.1 → 1.3.2.claude-plugin/marketplace.json— synced entryTDD evidence
The test uses the same
make_fake_wrapperpattern as existingtest-wrapper-pid.sh— extracts the lock-claim block, replaces the binary with/bin/sleep, runs with controlled lock state.Open verification (handled in idd-verify step)
id: nullempirical signal — JSON-RPC 2.0 allowsid: nullfor responses to unknown requests, but Claude Code's MCP client may or may not surface null-id error responses to the user. Manual two-session reproduction will confirm. If Claude Code drops the null-id response, a follow-up PR will addread stdin first line → jq extract id → respond with matching idlogic.Checklist
/idd-verify #31— manual two-session integration test critical)/idd-close #31after both PR-1 + PR-2 (or only PR-1 if PR-2 opt-out) shipGenerated by
/idd-implement. Do NOT add a GitHub close trailer (Closes/Fixes/Resolves) — IDD discipline requires manual/idd-closeafter merge to enforce checklist gate + closing summary across both PR-1 + PR-2.