fix(transport): proactive reconnect — detect peer-closed socket before sending#11
Conversation
ECR17/Nexi terminals close the TCP socket between transactions. The drop
was detected only reactively: ensureConnected() saw isConnected()==true for
a half-open socket (the Kotlin reader thread had not observed EOF yet), so a
command was sent on a dead socket; the read then failed mid-exchange and the
money-safety RetryPolicy (correctly) refused to replay financial commands,
surfacing a FALSE "transport disconnected during exchange". Safe/idempotent
commands were replayed and succeeded, producing the observed "works once,
fails next" alternation.
Fix: make Kotlin isConnected() a synchronous non-destructive liveness probe
(socket.sendUrgentData) that detects a peer-closed/half-open socket BEFORE
the send, so the existing ensureConnected() reconnects proactively and every
command starts on a verified-live socket. The probe writes one TCP OOB byte
that throws on a closed peer but does not touch the data stream (cannot
corrupt an STX/ETX frame, does not race the reader thread).
Money-safety is unchanged: RetryPolicy (financial never blindly retried) and
sendLastResult ('G') recovery are untouched; only the FALSE drop from a stale
pre-send socket is removed. A genuine mid-exchange drop still surfaces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e557443123
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent transaction failures caused by ECR17/Nexi terminals closing the TCP socket between transactions by adding a proactive (pre-send) connection liveness check, so commands start on a verified-live socket rather than discovering the drop mid-exchange.
Changes:
- Android: enhance
isConnected()with a synchronous liveness probe (sendUrgentData) and add a drop-marking helper. - C++/iOS: clarify the intended “pre-send liveness check” behavior via comments.
- Docs: capture the lesson learned and the rationale for the proactive probe.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| package/android/src/main/java/com/margelo/nitro/ecr17/HybridEcr17Transport.kt | Adds proactive socket liveness probing and a dropped-connection marker to avoid sending on half-open sockets. |
| package/cpp/Ecr17Client/HybridEcr17Client.cpp | Documents that ensureConnected() now relies on a proactive liveness probe before sending. |
| package/ios/HybridEcr17Transport.swift | Documents iOS parity expectations/limitations for the pre-send liveness check. |
| docs/LESSON.md | Records the real-device failure mode, root cause, and the proactive detection approach. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address reviewer feedback (Codex P2 + Copilot) on the liveness probe:
- Codex P2: socket.sendUrgentData writes a real TCP urgent byte; on a
terminal with SO_OOBINLINE that 0xFF lands inline before the next STX
frame and could corrupt a financial command. Replace it with a
non-destructive, WRITE-FREE probe: a 1-byte peek on a PushbackInputStream
(short soTimeout) — read()==-1 => peer closed; SocketTimeoutException =>
idle but alive; any byte read is unread() so a protocol byte is never
consumed or placed on the peer stream. The reader thread and the probe
share the input stream under one ioLock (reader uses a short read timeout
so it yields the lock between reads), so they never read concurrently.
- Copilot: a drop could fire onDisconnect twice (reader finally + probe).
Add a single-shot AtomicBoolean so a drop notifies exactly once across
both paths; both paths also close the socket so isConnected()'s isClosed
check is an immediate, reliable signal.
- Copilot: narrow exception handling — SocketTimeoutException means "alive",
only genuine I/O failures mark the socket dropped.
Money-safety unchanged: RetryPolicy and sendLastResult ('G') untouched;
only the FALSE drop from a stale pre-send socket is removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Copilot review: catch only IOException (covers SocketException/ SocketTimeoutException) in the reader loop and the liveness probe, so Errors (e.g. OOM) and unexpected RuntimeExceptions propagate and stay visible in development instead of being silently treated as a socket drop. The reader's finally still fires the single-shot drop notification on any unwind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A prior commit (44f178e) wrongly changed the ETX-fold switch in Lcr.cpp from `case LrcMode::NOEXT` to `case LrcMode::STD`, inverting the LrcMode semantics: - std = payload only (must NOT fold ETX) - noext = payload + ETX (must fold ETX) - stx = folds STX and ETX With the bug, default `std` mode folded ETX, so every frame's LRC was wrong (the terminal NAKs everything) and `test_lrc` failed on main. Revert that one line to `case LrcMode::NOEXT`. Verified locally against the test vectors (STD=0x3E, STX=0x3F, NOEXT=0x3D, STX_NOEXT=0x3C for payload 'A'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s don't block The liveness probe read under READ_TIMEOUT_MS (100ms), so a healthy but idle socket — the normal between-transactions state — made every pre-send check wait the full reader timeout. Set a 1ms soTimeout for the probe (safe under ioLock, restored before releasing), so an alive idle socket returns near-instantly while a peer-closed socket still reports EOF immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem (observed on a real device)
Financial commands (verifyCard, pay…) INTERMITTENTLY fail with
Error: ECR17: transport disconnected during exchange, while safe commands(status, totals) succeed — apparent "works once, fails next". Logs show each
failing command:
sent→connecting→connected→error: transport disconnected.Root cause
ECR17/Nexi terminals close the TCP socket between transactions. The drop was
detected only reactively:
HybridEcr17Client::ensureConnected()sawtransport_->isConnected()returntruefor a half-open socket (the Kotlinreader thread had not yet observed the EOF — a race), so the command was sent on
a dead socket. The read then failed mid-exchange;
runTransaction's catchreconnected and applied the money-safety
RetryPolicy: safe/idempotent commandswere replayed (→ ok), but financial commands were (correctly) NOT replayed →
a FALSE "transport disconnected" surfaced. The reactive reconnect left a fresh
socket, so the user's next attempt landed on a good one → the alternation.
The money-safety behavior was correct; the bug was discovering the drop after
the send instead of before.
Fix — proactive (pre-send) detection
Make Kotlin
isConnected()a synchronous, non-destructive liveness probe:It peeks one byte on a
PushbackInputStreamwith a tiny (1ms) read timeout and immediately pushes it back: a returned-1(EOF) means the peer closed, a read timeout means the (idle) socket is alive. It never writes to the peer (unlikesendUrgentData, which can land an inline0xFFbefore the next STX frame on terminals withSO_OOBINLINEand corrupt a financial command) and never consumes a protocol byte. It runs underioLockso it cannot race the reader thread'sread(). On a detected close the socket is marked dead +onDisconnectfires, so the existingensureConnected()(called at the start of every command) reconnects before the send — every command starts on a verified-live socket.HybridEcr17Transport.kt—isConnected()liveness probe +markDropped().HybridEcr17Client.cpp— clarifying comment on the now-proactiveensureConnected().HybridEcr17Transport.swift— comment noting.readyparity + async caveat (best-effort, no iOS CI).docs/LESSON.md— the finding.Money-safety (unchanged)
RetryPolicy.hpp(financial commands never blindly retried) — untouched.sendLastResult('G') recovery — untouched.mid-exchange drop still surfaces and is recovered via 'G'.
onDisconnect(sets the sessiondisconnected_flag), but everyexchange()/sendAckOnly()callsresetForNewTransaction()first, so it cannotpoison the next real exchange.
Verification
Android build CI (dispatched on this branch).
🤖 Generated with Claude Code
LRC regression fix (included)
This PR also reverts a pre-existing LRC regression in
package/cpp/Lcr/Lcr.cpp. TheETX-folding switch had been changed from
case LrcMode::NOEXT:tocase LrcMode::STD:,which is wrong per the
LrcModesemantics:std— fold payload only (must not fold ETX)noext— fold payload + ETX (must fold ETX)stx/stx_noext— also fold STXWith the bug, the default
stdmode folded ETX, so every application frame carried awrong LRC (the terminal NAKs everything) and
test_lrcfailed. Restored tocase LrcMode::STX: case LrcMode::NOEXT:— cpp-tests green, runtime framing correct.