Skip to content

fix(transport): proactive reconnect — detect peer-closed socket before sending#11

Merged
lopadova merged 6 commits into
mainfrom
fix/proactive-reconnect
May 29, 2026
Merged

fix(transport): proactive reconnect — detect peer-closed socket before sending#11
lopadova merged 6 commits into
mainfrom
fix/proactive-reconnect

Conversation

@lopadova
Copy link
Copy Markdown
Contributor

@lopadova lopadova commented May 28, 2026

Problem (observed on a real device)

Financial commands (verifyCard, pay…) INTERMITTENTLY fail with
Error: ECR17: transport disconnected during exchange, while safe commands
(status, totals) succeed — apparent "works once, fails next". Logs show each
failing command: sentconnectingconnectederror: transport disconnected.

Root cause

ECR17/Nexi terminals close the TCP socket between transactions. The drop was
detected only reactively: HybridEcr17Client::ensureConnected() saw
transport_->isConnected() return true for a half-open socket (the Kotlin
reader thread had not yet observed the EOF — a race), so the command was sent on
a dead socket
. The read then failed mid-exchange; runTransaction's catch
reconnected and applied the money-safety RetryPolicy: safe/idempotent commands
were replayed (→ ok), but financial commands were (correctly) NOT replayed →
a FALSE "transport disconnected" surfaced. The reactive reconnect left a fresh
socket, so the user's next attempt landed on a good one → the alternation.

The money-safety behavior was correct; the bug was discovering the drop after
the send instead of before.

Fix — proactive (pre-send) detection

Make Kotlin isConnected() a synchronous, non-destructive liveness probe:
It peeks one byte on a PushbackInputStream with a tiny (1ms) read timeout and immediately pushes it back: a returned -1 (EOF) means the peer closed, a read timeout means the (idle) socket is alive. It never writes to the peer (unlike sendUrgentData, which can land an inline 0xFF before the next STX frame on terminals with SO_OOBINLINE and corrupt a financial command) and never consumes a protocol byte. It runs under ioLock so it cannot race the reader thread's read(). On a detected close the socket is marked dead + onDisconnect fires, so the existing ensureConnected() (called at the start of every command) reconnects before the send — every command starts on a verified-live socket.

  • HybridEcr17Transport.ktisConnected() liveness probe + markDropped().
  • HybridEcr17Client.cpp — clarifying comment on the now-proactive ensureConnected().
  • HybridEcr17Transport.swift — comment noting .ready parity + async caveat (best-effort, no iOS CI).
  • docs/LESSON.md — the finding.

Money-safety (unchanged)

  • RetryPolicy.hpp (financial commands never blindly retried) — untouched.
  • sendLastResult ('G') recovery — untouched.
  • Only the FALSE drop from a stale pre-send socket is removed; a genuine
    mid-exchange drop still surfaces and is recovered via 'G'.
  • The probe fires onDisconnect (sets the session disconnected_ flag), but every
    exchange()/sendAckOnly() calls resetForNewTransaction() first, so it cannot
    poison the next real exchange.

Verification

  • TS typecheck green locally. Native (C++/Kotlin) verified only by the manual
    Android build CI (dispatched on this branch).

🤖 Generated with Claude Code

LRC regression fix (included)

This PR also reverts a pre-existing LRC regression in package/cpp/Lcr/Lcr.cpp. The
ETX-folding switch had been changed from case LrcMode::NOEXT: to case LrcMode::STD:,
which is wrong per the LrcMode semantics:

  • std — fold payload only (must not fold ETX)
  • noext — fold payload + ETX (must fold ETX)
  • stx / stx_noext — also fold STX

With the bug, the default std mode folded ETX, so every application frame carried a
wrong LRC (the terminal NAKs everything) and test_lrc failed. Restored to
case LrcMode::STX: case LrcMode::NOEXT: — cpp-tests green, runtime framing correct.

ECR17/Nexi terminals close the TCP socket between transactions. The drop
was detected only reactively: ensureConnected() saw isConnected()==true for
a half-open socket (the Kotlin reader thread had not observed EOF yet), so a
command was sent on a dead socket; the read then failed mid-exchange and the
money-safety RetryPolicy (correctly) refused to replay financial commands,
surfacing a FALSE "transport disconnected during exchange". Safe/idempotent
commands were replayed and succeeded, producing the observed "works once,
fails next" alternation.

Fix: make Kotlin isConnected() a synchronous non-destructive liveness probe
(socket.sendUrgentData) that detects a peer-closed/half-open socket BEFORE
the send, so the existing ensureConnected() reconnects proactively and every
command starts on a verified-live socket. The probe writes one TCP OOB byte
that throws on a closed peer but does not touch the data stream (cannot
corrupt an STX/ETX frame, does not race the reader thread).

Money-safety is unchanged: RetryPolicy (financial never blindly retried) and
sendLastResult ('G') recovery are untouched; only the FALSE drop from a stale
pre-send socket is removed. A genuine mid-exchange drop still surfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e557443123

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread package/android/src/main/java/com/margelo/nitro/ecr17/HybridEcr17Transport.kt Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent transaction failures caused by ECR17/Nexi terminals closing the TCP socket between transactions by adding a proactive (pre-send) connection liveness check, so commands start on a verified-live socket rather than discovering the drop mid-exchange.

Changes:

  • Android: enhance isConnected() with a synchronous liveness probe (sendUrgentData) and add a drop-marking helper.
  • C++/iOS: clarify the intended “pre-send liveness check” behavior via comments.
  • Docs: capture the lesson learned and the rationale for the proactive probe.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
package/android/src/main/java/com/margelo/nitro/ecr17/HybridEcr17Transport.kt Adds proactive socket liveness probing and a dropped-connection marker to avoid sending on half-open sockets.
package/cpp/Ecr17Client/HybridEcr17Client.cpp Documents that ensureConnected() now relies on a proactive liveness probe before sending.
package/ios/HybridEcr17Transport.swift Documents iOS parity expectations/limitations for the pre-send liveness check.
docs/LESSON.md Records the real-device failure mode, root cause, and the proactive detection approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread package/android/src/main/java/com/margelo/nitro/ecr17/HybridEcr17Transport.kt Outdated
Address reviewer feedback (Codex P2 + Copilot) on the liveness probe:

- Codex P2: socket.sendUrgentData writes a real TCP urgent byte; on a
  terminal with SO_OOBINLINE that 0xFF lands inline before the next STX
  frame and could corrupt a financial command. Replace it with a
  non-destructive, WRITE-FREE probe: a 1-byte peek on a PushbackInputStream
  (short soTimeout) — read()==-1 => peer closed; SocketTimeoutException =>
  idle but alive; any byte read is unread() so a protocol byte is never
  consumed or placed on the peer stream. The reader thread and the probe
  share the input stream under one ioLock (reader uses a short read timeout
  so it yields the lock between reads), so they never read concurrently.

- Copilot: a drop could fire onDisconnect twice (reader finally + probe).
  Add a single-shot AtomicBoolean so a drop notifies exactly once across
  both paths; both paths also close the socket so isConnected()'s isClosed
  check is an immediate, reliable signal.

- Copilot: narrow exception handling — SocketTimeoutException means "alive",
  only genuine I/O failures mark the socket dropped.

Money-safety unchanged: RetryPolicy and sendLastResult ('G') untouched;
only the FALSE drop from a stale pre-send socket is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Per Copilot review: catch only IOException (covers SocketException/
SocketTimeoutException) in the reader loop and the liveness probe, so Errors
(e.g. OOM) and unexpected RuntimeExceptions propagate and stay visible in
development instead of being silently treated as a socket drop. The reader's
finally still fires the single-shot drop notification on any unwind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

A prior commit (44f178e) wrongly changed the ETX-fold switch in Lcr.cpp from
`case LrcMode::NOEXT` to `case LrcMode::STD`, inverting the LrcMode semantics:
- std   = payload only (must NOT fold ETX)
- noext = payload + ETX (must fold ETX)
- stx   = folds STX and ETX

With the bug, default `std` mode folded ETX, so every frame's LRC was wrong
(the terminal NAKs everything) and `test_lrc` failed on main. Revert that one
line to `case LrcMode::NOEXT`. Verified locally against the test vectors
(STD=0x3E, STX=0x3F, NOEXT=0x3D, STX_NOEXT=0x3C for payload 'A').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread package/cpp/Lcr/Lcr.cpp
…s don't block

The liveness probe read under READ_TIMEOUT_MS (100ms), so a healthy but idle
socket — the normal between-transactions state — made every pre-send check wait
the full reader timeout. Set a 1ms soTimeout for the probe (safe under ioLock,
restored before releasing), so an alive idle socket returns near-instantly while
a peer-closed socket still reports EOF immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread PROGRESS.md
@lopadova lopadova merged commit 46f57d9 into main May 29, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants