Skip to content

fix(container): harden Apple async container ops (network delete + service IP)#367

Open
abezzub-dr wants to merge 5 commits into
majorcontext:mainfrom
abezzub-dr:fix/apple-network-leak-retry
Open

fix(container): harden Apple async container ops (network delete + service IP)#367
abezzub-dr wants to merge 5 commits into
majorcontext:mainfrom
abezzub-dr:fix/apple-network-leak-retry

Conversation

@abezzub-dr
Copy link
Copy Markdown
Contributor

@abezzub-dr abezzub-dr commented Jun 2, 2026

Summary

Three fixes for the Apple containers runtime. Two are async-teardown/startup races; one is an ID-parsing bug that container CLI 0.12.x exposed. All were found debugging a real moat claude --runtime apple run.

1. Orphaned networks exhaust the IP pool (teardown)

Apple removes containers asynchronously, so the container network delete in cleanupResources often ran before the run's containers had detached, failing with (from ~/.moat/debug/):

cannot delete subnet moat-run_… because the IP allocator cannot be disabled with active containers
network moat-… has a pending operation

RemoveNetwork made a single attempt and ForceRemoveNetwork re-issued the same command, so the network leaked; accumulated moat-run_* networks eventually exhaust Apple's /24 pool (192.168.64–127) and block new runs. Fix: RemoveNetwork retries with exponential backoff until the async detach completes.

2. Service runs abort with "no network address found for container" (the blocker)

On container CLI 0.12.x, service runs failed immediately. Root cause: StartService read the container ID from CombinedOutput() of container run --detach, but newer container versions write startup progress ([1/6] Fetching image, …) to stderr. The captured "ID" became the whole progress blob plus the name, so the follow-up inspect matched no container and the run aborted — even though the sidecar was running fine with an address (verified: a sidecar moat rejected was running with 192.168.98.2). Fix: read the ID from stdout only (stderr captured separately for errors), parsed as the last non-empty line.

3. Service container IP polling (defense-in-depth)

Apple assigns a container's address asynchronously. getContainerIP now polls until the address appears instead of failing on the first empty inspect. (In practice run --detach blocks until the address is assigned, so this is a safety net rather than the primary fix for #2.)

Tests

  • go test ./internal/container/ — pass. New unit tests: retryable-error classifier, retry-until-detached, fail-fast, not-found-as-success, give-up-after-max (network); pure run-ID parser (strips progress lines); pure IPv4 parser, poll-until-assigned, timeout, transient-inspect-error (service).
  • go vet + golangci-lint clean on changed files.

Scope / follow-ups (not in this PR)

  • A genuinely wedged network (a "pending operation" persisting for hours, not seconds) is a deeper Apple container daemon-state bug that retrying cannot clear — container system stop && start or a reboot is required.
  • The orphan-network reaper (cleanOrphanNetworks) only sweeps the default runtime and only reaps networks whose run directory is already gone — making it sweep all initialized runtimes is a separate change.
  • Already-leaked networks still need moat clean (with this fix built in) or a manual container network delete sweep to reclaim.

🤖 Generated with Claude Code

…detach

Apple's container CLI removes containers asynchronously, so the
`container network delete` issued during run teardown often fired before
the run's containers had detached, failing with "active containers" or
"network has a pending operation". RemoveNetwork made a single attempt and
ForceRemoveNetwork just re-issued the identical command, so both gave up
and the network leaked — accumulating orphaned networks that eventually
exhaust Apple's IP pool.

RemoveNetwork now retries with exponential backoff on these transient
errors until the detach completes (bounded by networkDeleteMaxAttempts),
while non-transient failures still return immediately. The delete command
and backoff base are injectable so the retry logic is unit-tested without
the real CLI.
@abezzub-dr abezzub-dr marked this pull request as draft June 2, 2026 15:03
Apple's container CLI assigns a container's network address asynchronously,
so the `container inspect` issued immediately after `run --detach` could
return before the IPv4 address was attached. getContainerIP made a single
attempt and failed the whole run with "no network address found for
container", even though the address appears moments later.

getContainerIP now polls (bounded by containerIPMaxAttempts) until the
address is assigned. The address parsing is extracted into a pure,
unit-tested helper, and the inspect command and poll interval are injectable
so the retry is tested without the real CLI.
@abezzub-dr abezzub-dr changed the title fix(container): retry Apple network deletion through async container detach fix(container): harden Apple async container ops (network delete + service IP) Jun 3, 2026
The real cause of "no network address found for container" on container
CLI 0.12.x: StartService read the container ID from CombinedOutput() of
`container run --detach`, but newer container CLI versions write startup
progress ("[1/6] Fetching image", ...) to stderr. The captured ID was
therefore the whole progress blob plus the name, so the follow-up inspect
matched no container and the run aborted — even though the sidecar was
running fine with an address.

Read the ID from stdout only (stderr is captured separately for error
messages) and parse it as the last non-empty line for resilience. This is
the actual fix for the startup failure; the earlier address-polling change
remains as defense-in-depth.
@abezzub-dr abezzub-dr marked this pull request as ready for review June 3, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant