Skip to content

fix: make every byte<->str boundary UTF-8-safe on Windows (#91, #97)#104

Merged
fdaviddpt merged 3 commits into
mainfrom
fix/windows-encoding-boundaries-97-91
Jun 18, 2026
Merged

fix: make every byte<->str boundary UTF-8-safe on Windows (#91, #97)#104
fdaviddpt merged 3 commits into
mainfrom
fix/windows-encoding-boundaries-97-91

Conversation

@fdaviddpt

Copy link
Copy Markdown
Contributor

Context

On Windows the autosave crashed on every run (#97 — lone-surrogate UnicodeEncodeError) and corrupted memory into mojibake (#91/ decoded as cp1252). CI was green because every test mocks the two process boundaries (StringIO stdin, MagicMock subprocess), so the real decode never executed — on any OS. A green Windows matrix proved the OS ran the code, not that bytes survived the pipe.

Full boundary audit

Every place a non-UTF-8 byte can enter, fixed:

Boundary Source Fix
parse-haiku stdin pipe claude JSON via bash sys.stdin.reconfigure(encoding="utf-8") (guarded)
call_haiku subprocess claude stdout subprocess.run(..., encoding="utf-8", errors="replace")
memory-file reads (now.md, staging, recent, archive, last_entry) user-editable errors="replace"
session-JSONL reads (extract.py) Claude Code transcript errors="replace"
all text writes (shell.py) stray surrogate errors="replace"
staging-paths filename encode OS listdir (surrogateescape) .encode("utf-8", "surrogatepass")
prompt template read our repo (trusted) left strict (surface dev bugs)

Tests — the part that was missing

tests/test_encoding_boundaries.py exercises the real boundaries under a forced non-UTF-8 locale (PYTHONUTF8=0 PYTHONCOERCECLOCALE=0 LC_ALL=C) — so they reproduce on the Linux/macOS CI legs too. That forcing is exactly what the green Windows matrix lacked. Covers: UTF-8/CJK pipe round-trip, subprocess decode, lone-surrogate writes, and corrupt-file reads. Each fix was TDD'd (red proven by toggling the guard off). 441 passed.

Closes #97, #91.

🤖 Generated by Max

Florian DAVID added 3 commits June 18, 2026 20:48
Windows autosave was crashing (#97, lone-surrogate UnicodeEncodeError) and
corrupting memory into mojibake (#91, cp1252 mis-decode). CI was green because
every test mocks the two process boundaries (StringIO stdin, MagicMock
subprocess) — the decode never ran, on any OS.

Audited ALL byte<->str boundaries and made each UTF-8-explicit / crash-proof:
- INPUT decode (was locale-dependent):
  * shell.py parse-haiku: sys.stdin.reconfigure(encoding='utf-8') (guarded for
    StringIO tests) — redirected pipes use cp1252 on Windows, not UTF-8.
  * haiku.py: subprocess.run(..., encoding='utf-8', errors='replace') — text=True
    otherwise decodes claude's UTF-8 stdout as cp1252.
- READ of user-editable / external files (strict -> crash on a hand-edited byte):
  * shell.py memory-file reads (now.md, staging, recent, archive, last_entry) and
    extract.py session-JSONL reads: errors='replace'.
- WRITE (strict -> crash on a stray surrogate): all shell.py text writes get
  errors='replace'.
- FILENAME encode for the staging-paths bytes file: surrogatepass, so OS-listed
  names with undecodable bytes round-trip to the shell.
Template reads (prompts.py) stay strict — trusted repo files, surface dev bugs.

Tests (test_encoding_boundaries.py): real boundaries under a FORCED non-UTF-8
locale (PYTHONUTF8=0 PYTHONCOERCECLOCALE=0 LC_ALL=C) so they reproduce on the
Linux/macOS CI legs too — the forcing is precisely what the green Windows
matrix lacked. Pipe round-trip (arrow/CJK), subprocess decode, lone-surrogate
writes, and corrupt-file reads. 441 passed.

Closes #97
Closes #91

Co-Authored-By: Max <noreply>
The fake `claude` is a shebang+chmod script; Windows CreateProcess only runs
.exe, so the real-binary decode test fails with FileNotFoundError on the
windows-latest leg (CI #104). Skip it on win32 (the real decode under a forced
locale still runs on the POSIX legs) and add a cross-platform kwargs assertion
that call_haiku passes encoding='utf-8'/errors='replace' to subprocess.run —
so Windows still verifies the #91 decode fix is wired.

Co-Authored-By: Max <noreply>
…ow-up

Adversarial self-review caught that the blanket errors='replace' over-applied to
last-save.json, a machine-written structured file. A U+FFFD patched into it could
yield a wrong resume line (silent skip/re-process) instead of failing cleanly.

Revert both last-save.json sites to strict and broaden the read's except to
ValueError (covers JSONDecodeError + UnicodeDecodeError) → a corrupt file returns
0 (re-extract from start), never crashes and never a U+FFFD-mutated line. This is
better than the original too: strict + the old narrow except would have CRASHED on
a corrupt file (UnicodeDecodeError uncaught). User-prose reads/writes keep
errors='replace' — never crash a save on a hand-edited byte.

Test: corrupt last-save.json -> 0 (red proven by narrowing the except back).

Co-Authored-By: Max <noreply>
@fdaviddpt fdaviddpt merged commit 02791ae into main Jun 18, 2026
12 checks passed
@fdaviddpt fdaviddpt deleted the fix/windows-encoding-boundaries-97-91 branch June 18, 2026 19:10
fdaviddpt pushed a commit that referenced this pull request Jun 18, 2026
…ch-up

PATH ordering can't fix the WSL-bash problem: Windows CreateProcess searches
System32 BEFORE PATH, so subprocess.run(['bash']) always hits System32\bash.exe
(the WSL launcher). Add _find_bash() (plain 'bash' on POSIX, explicit
Git-for-Windows bash.exe on Windows, mirroring test_hooks_json._find_git_bash)
and invoke it everywhere; module skipif when Git Bash is absent. Revert the
ineffective GITHUB_PATH workflow shim. Combined with the earlier MSYS path
normalization, the three modules now run on Windows (TestDispatchOwnershipChecks
stays skipped — NTFS has no POSIX mode bits).

Also bring CHANGELOG [Unreleased] current — it was untouched across this
session's 7 merged PRs (#90/#93/#99/#101/#102/#103/#104). Added Fixed/Security
entries with issue/PR links + contributor credit, plus the #79 and encoding
test-module notes.

Co-Authored-By: Max <noreply>
fdaviddpt added a commit that referenced this pull request Jun 19, 2026
* ci: run shell-subprocess tests on Windows via Git Bash on PATH (#79)

The Windows runner's PATH bash resolves to the WSL launcher (no distro
installed) → every subprocess.run(['bash', ...]) returned the UTF-16 'no
distributions' error, so #78 skipped these modules on win32 — shipping no
Windows coverage for code paths that DO run there (Claude Code invokes hooks
via Git Bash).

- tests.yml: prepend C:\Program Files\Git\bin to PATH on the Windows row so
  bash resolves to Git Bash.
- Drop the module-level skipif(win32) from test_log_sh, test_migration,
  test_security_fixes.
- Keep TestDispatchOwnershipChecks skipped on win32: POSIX ownership +
  world-writable (0o777) semantics genuinely don't map to NTFS (Git Bash
  fakes mode bits) — the dispatch() guard is a no-op there.

If a now-unskipped test surfaces a real Windows bug (CRLF, path form), it gets
a targeted fix/follow-up, not a re-skip.

Co-Authored-By: Max <noreply>

* ci: prepend Git Bash in-step + MSYS-normalize paths for Windows shell tests (#79)

Two real Windows layers surfaced once the skips were removed:
- bash still resolved to the WSL launcher (System32\bash.exe) — appending to
  GITHUB_PATH didn't override it. Prepend C:\Program Files\Git\bin to $env:PATH
  in the same pwsh step that runs pytest, so subprocess.run(['bash']) children
  inherit Git Bash.
- the tests interpolate Windows backslash/drive paths into bash scripts
  (source C:\..., export DIR=C:\...), which bash mangles (\a/\c escapes,
  spaces). Add _bash_path() (C:\x -> /c/x, backslashes -> /) and apply+quote it
  at every path injected into a bash script or path-valued env var. No-op on
  POSIX. RCE-payload paths inside safe_eval heredocs left literal (intentional).

Co-Authored-By: Max <noreply>

* ci: invoke bash by explicit Git path on Windows (#79) + CHANGELOG catch-up

PATH ordering can't fix the WSL-bash problem: Windows CreateProcess searches
System32 BEFORE PATH, so subprocess.run(['bash']) always hits System32\bash.exe
(the WSL launcher). Add _find_bash() (plain 'bash' on POSIX, explicit
Git-for-Windows bash.exe on Windows, mirroring test_hooks_json._find_git_bash)
and invoke it everywhere; module skipif when Git Bash is absent. Revert the
ineffective GITHUB_PATH workflow shim. Combined with the earlier MSYS path
normalization, the three modules now run on Windows (TestDispatchOwnershipChecks
stays skipped — NTFS has no POSIX mode bits).

Also bring CHANGELOG [Unreleased] current — it was untouched across this
session's 7 merged PRs (#90/#93/#99/#101/#102/#103/#104). Added Fixed/Security
entries with issue/PR links + contributor credit, plus the #79 and encoding
test-module notes.

Co-Authored-By: Max <noreply>

* test: use forward-slash drive paths (C:/x) for bash+Windows-Python crossing (#79)

The MSYS form (/c/x) works for Git Bash but the test scripts also invoke the
Windows python3 (jq fallback's $PYTHON, migration's REMEMBER_DIR -> Path()),
and Windows Python can't open('/c/x'). Forward-slash drive form (C:/x) is
accepted by BOTH Git Bash and Windows Python. Switch _bash_path to emit C:/x,
and route the migration data_dir config through it so REMEMBER_DIR round-trips
to Python Path(). No-op on POSIX.

Co-Authored-By: Max <noreply>

* test: instrument migration test to capture Windows runner state (#79, temp)

* fix: recognize Windows drive paths (C:/, C:\) as absolute in lib-memory-dir.sh (#79)

External-mode data_dir resolution only treated /... and ~... as absolute, so a
Windows drive path (C:/Users/.../mem/{slug}) fell through to the relative branch
and got prepended to PROJECT_DIR — REMEMBER_DIR became proj/C:/.../{{slug}} (path
doubling + slug never substituted, since substitution lives in the absolute
branch). Real Windows users with a C:/-form data_dir hit this. Add
[A-Za-z]:/* and [A-Za-z]:\\* to the absolute case. Diagnosed from a Windows-CI
state dump; instrumentation reverted.

Co-Authored-By: Max <noreply>

* docs(changelog): record the lib-memory-dir.sh Windows data_dir fix (#79)

---------

Co-authored-by: Florian DAVID <fdavid@digital-village.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows: lone-surrogate UnicodeEncodeError crashes every autosave (shell.py text writes need errors="replace")

1 participant