fix: make every byte<->str boundary UTF-8-safe on Windows (#91, #97)#104
Merged
Conversation
added 3 commits
June 18, 2026 20:48
Windows autosave was crashing (#97, lone-surrogate UnicodeEncodeError) and corrupting memory into mojibake (#91, cp1252 mis-decode). CI was green because every test mocks the two process boundaries (StringIO stdin, MagicMock subprocess) — the decode never ran, on any OS. Audited ALL byte<->str boundaries and made each UTF-8-explicit / crash-proof: - INPUT decode (was locale-dependent): * shell.py parse-haiku: sys.stdin.reconfigure(encoding='utf-8') (guarded for StringIO tests) — redirected pipes use cp1252 on Windows, not UTF-8. * haiku.py: subprocess.run(..., encoding='utf-8', errors='replace') — text=True otherwise decodes claude's UTF-8 stdout as cp1252. - READ of user-editable / external files (strict -> crash on a hand-edited byte): * shell.py memory-file reads (now.md, staging, recent, archive, last_entry) and extract.py session-JSONL reads: errors='replace'. - WRITE (strict -> crash on a stray surrogate): all shell.py text writes get errors='replace'. - FILENAME encode for the staging-paths bytes file: surrogatepass, so OS-listed names with undecodable bytes round-trip to the shell. Template reads (prompts.py) stay strict — trusted repo files, surface dev bugs. Tests (test_encoding_boundaries.py): real boundaries under a FORCED non-UTF-8 locale (PYTHONUTF8=0 PYTHONCOERCECLOCALE=0 LC_ALL=C) so they reproduce on the Linux/macOS CI legs too — the forcing is precisely what the green Windows matrix lacked. Pipe round-trip (arrow/CJK), subprocess decode, lone-surrogate writes, and corrupt-file reads. 441 passed. Closes #97 Closes #91 Co-Authored-By: Max <noreply>
The fake `claude` is a shebang+chmod script; Windows CreateProcess only runs .exe, so the real-binary decode test fails with FileNotFoundError on the windows-latest leg (CI #104). Skip it on win32 (the real decode under a forced locale still runs on the POSIX legs) and add a cross-platform kwargs assertion that call_haiku passes encoding='utf-8'/errors='replace' to subprocess.run — so Windows still verifies the #91 decode fix is wired. Co-Authored-By: Max <noreply>
…ow-up Adversarial self-review caught that the blanket errors='replace' over-applied to last-save.json, a machine-written structured file. A U+FFFD patched into it could yield a wrong resume line (silent skip/re-process) instead of failing cleanly. Revert both last-save.json sites to strict and broaden the read's except to ValueError (covers JSONDecodeError + UnicodeDecodeError) → a corrupt file returns 0 (re-extract from start), never crashes and never a U+FFFD-mutated line. This is better than the original too: strict + the old narrow except would have CRASHED on a corrupt file (UnicodeDecodeError uncaught). User-prose reads/writes keep errors='replace' — never crash a save on a hand-edited byte. Test: corrupt last-save.json -> 0 (red proven by narrowing the except back). Co-Authored-By: Max <noreply>
fdaviddpt
pushed a commit
that referenced
this pull request
Jun 18, 2026
…ch-up PATH ordering can't fix the WSL-bash problem: Windows CreateProcess searches System32 BEFORE PATH, so subprocess.run(['bash']) always hits System32\bash.exe (the WSL launcher). Add _find_bash() (plain 'bash' on POSIX, explicit Git-for-Windows bash.exe on Windows, mirroring test_hooks_json._find_git_bash) and invoke it everywhere; module skipif when Git Bash is absent. Revert the ineffective GITHUB_PATH workflow shim. Combined with the earlier MSYS path normalization, the three modules now run on Windows (TestDispatchOwnershipChecks stays skipped — NTFS has no POSIX mode bits). Also bring CHANGELOG [Unreleased] current — it was untouched across this session's 7 merged PRs (#90/#93/#99/#101/#102/#103/#104). Added Fixed/Security entries with issue/PR links + contributor credit, plus the #79 and encoding test-module notes. Co-Authored-By: Max <noreply>
fdaviddpt
added a commit
that referenced
this pull request
Jun 19, 2026
* ci: run shell-subprocess tests on Windows via Git Bash on PATH (#79) The Windows runner's PATH bash resolves to the WSL launcher (no distro installed) → every subprocess.run(['bash', ...]) returned the UTF-16 'no distributions' error, so #78 skipped these modules on win32 — shipping no Windows coverage for code paths that DO run there (Claude Code invokes hooks via Git Bash). - tests.yml: prepend C:\Program Files\Git\bin to PATH on the Windows row so bash resolves to Git Bash. - Drop the module-level skipif(win32) from test_log_sh, test_migration, test_security_fixes. - Keep TestDispatchOwnershipChecks skipped on win32: POSIX ownership + world-writable (0o777) semantics genuinely don't map to NTFS (Git Bash fakes mode bits) — the dispatch() guard is a no-op there. If a now-unskipped test surfaces a real Windows bug (CRLF, path form), it gets a targeted fix/follow-up, not a re-skip. Co-Authored-By: Max <noreply> * ci: prepend Git Bash in-step + MSYS-normalize paths for Windows shell tests (#79) Two real Windows layers surfaced once the skips were removed: - bash still resolved to the WSL launcher (System32\bash.exe) — appending to GITHUB_PATH didn't override it. Prepend C:\Program Files\Git\bin to $env:PATH in the same pwsh step that runs pytest, so subprocess.run(['bash']) children inherit Git Bash. - the tests interpolate Windows backslash/drive paths into bash scripts (source C:\..., export DIR=C:\...), which bash mangles (\a/\c escapes, spaces). Add _bash_path() (C:\x -> /c/x, backslashes -> /) and apply+quote it at every path injected into a bash script or path-valued env var. No-op on POSIX. RCE-payload paths inside safe_eval heredocs left literal (intentional). Co-Authored-By: Max <noreply> * ci: invoke bash by explicit Git path on Windows (#79) + CHANGELOG catch-up PATH ordering can't fix the WSL-bash problem: Windows CreateProcess searches System32 BEFORE PATH, so subprocess.run(['bash']) always hits System32\bash.exe (the WSL launcher). Add _find_bash() (plain 'bash' on POSIX, explicit Git-for-Windows bash.exe on Windows, mirroring test_hooks_json._find_git_bash) and invoke it everywhere; module skipif when Git Bash is absent. Revert the ineffective GITHUB_PATH workflow shim. Combined with the earlier MSYS path normalization, the three modules now run on Windows (TestDispatchOwnershipChecks stays skipped — NTFS has no POSIX mode bits). Also bring CHANGELOG [Unreleased] current — it was untouched across this session's 7 merged PRs (#90/#93/#99/#101/#102/#103/#104). Added Fixed/Security entries with issue/PR links + contributor credit, plus the #79 and encoding test-module notes. Co-Authored-By: Max <noreply> * test: use forward-slash drive paths (C:/x) for bash+Windows-Python crossing (#79) The MSYS form (/c/x) works for Git Bash but the test scripts also invoke the Windows python3 (jq fallback's $PYTHON, migration's REMEMBER_DIR -> Path()), and Windows Python can't open('/c/x'). Forward-slash drive form (C:/x) is accepted by BOTH Git Bash and Windows Python. Switch _bash_path to emit C:/x, and route the migration data_dir config through it so REMEMBER_DIR round-trips to Python Path(). No-op on POSIX. Co-Authored-By: Max <noreply> * test: instrument migration test to capture Windows runner state (#79, temp) * fix: recognize Windows drive paths (C:/, C:\) as absolute in lib-memory-dir.sh (#79) External-mode data_dir resolution only treated /... and ~... as absolute, so a Windows drive path (C:/Users/.../mem/{slug}) fell through to the relative branch and got prepended to PROJECT_DIR — REMEMBER_DIR became proj/C:/.../{{slug}} (path doubling + slug never substituted, since substitution lives in the absolute branch). Real Windows users with a C:/-form data_dir hit this. Add [A-Za-z]:/* and [A-Za-z]:\\* to the absolute case. Diagnosed from a Windows-CI state dump; instrumentation reverted. Co-Authored-By: Max <noreply> * docs(changelog): record the lib-memory-dir.sh Windows data_dir fix (#79) --------- Co-authored-by: Florian DAVID <fdavid@digital-village.fr>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
On Windows the autosave crashed on every run (#97 — lone-surrogate
UnicodeEncodeError) and corrupted memory into mojibake (#91 —→/—decoded as cp1252). CI was green because every test mocks the two process boundaries (StringIOstdin,MagicMocksubprocess), so the real decode never executed — on any OS. A green Windows matrix proved the OS ran the code, not that bytes survived the pipe.Full boundary audit
Every place a non-UTF-8 byte can enter, fixed:
parse-haikustdin pipesys.stdin.reconfigure(encoding="utf-8")(guarded)call_haikusubprocesssubprocess.run(..., encoding="utf-8", errors="replace")errors="replace"extract.py)errors="replace"shell.py)errors="replace"listdir(surrogateescape).encode("utf-8", "surrogatepass")Tests — the part that was missing
tests/test_encoding_boundaries.pyexercises the real boundaries under a forced non-UTF-8 locale (PYTHONUTF8=0 PYTHONCOERCECLOCALE=0 LC_ALL=C) — so they reproduce on the Linux/macOS CI legs too. That forcing is exactly what the green Windows matrix lacked. Covers: UTF-8/CJK pipe round-trip, subprocess decode, lone-surrogate writes, and corrupt-file reads. Each fix was TDD'd (red proven by toggling the guard off). 441 passed.Closes #97, #91.
🤖 Generated by Max