fix(container): harden Apple async container ops (network delete + service IP) by abezzub-dr · Pull Request #367 · majorcontext/moat

abezzub-dr · 2026-06-02T15:01:28Z

Summary

Three fixes for the Apple containers runtime. Two are async-teardown/startup races; one is an ID-parsing bug that container CLI 0.12.x exposed. All were found debugging a real moat claude --runtime apple run.

1. Orphaned networks exhaust the IP pool (teardown)

Apple removes containers asynchronously, so the container network delete in cleanupResources often ran before the run's containers had detached, failing with (from ~/.moat/debug/):

cannot delete subnet moat-run_… because the IP allocator cannot be disabled with active containers
network moat-… has a pending operation

RemoveNetwork made a single attempt and ForceRemoveNetwork re-issued the same command, so the network leaked; accumulated moat-run_* networks eventually exhaust Apple's /24 pool (192.168.64–127) and block new runs. Fix: RemoveNetwork retries with exponential backoff until the async detach completes.

2. Service runs abort with "no network address found for container" (the blocker)

On container CLI 0.12.x, service runs failed immediately. Root cause: StartService read the container ID from CombinedOutput() of container run --detach, but newer container versions write startup progress ([1/6] Fetching image, …) to stderr. The captured "ID" became the whole progress blob plus the name, so the follow-up inspect matched no container and the run aborted — even though the sidecar was running fine with an address (verified: a sidecar moat rejected was running with 192.168.98.2). Fix: read the ID from stdout only (stderr captured separately for errors), parsed as the last non-empty line.

3. Service container IP polling (defense-in-depth)

Apple assigns a container's address asynchronously. getContainerIP now polls until the address appears instead of failing on the first empty inspect. (In practice run --detach blocks until the address is assigned, so this is a safety net rather than the primary fix for #2.)

Tests

go test ./internal/container/ — pass. New unit tests: retryable-error classifier, retry-until-detached, fail-fast, not-found-as-success, give-up-after-max (network); pure run-ID parser (strips progress lines); pure IPv4 parser, poll-until-assigned, timeout, transient-inspect-error (service).
go vet + golangci-lint clean on changed files.

Scope / follow-ups (not in this PR)

A genuinely wedged network (a "pending operation" persisting for hours, not seconds) is a deeper Apple container daemon-state bug that retrying cannot clear — container system stop && start or a reboot is required.
The orphan-network reaper (cleanOrphanNetworks) only sweeps the default runtime and only reaps networks whose run directory is already gone — making it sweep all initialized runtimes is a separate change.
Already-leaked networks still need moat clean (with this fix built in) or a manual container network delete sweep to reclaim.

🤖 Generated with Claude Code

…detach Apple's container CLI removes containers asynchronously, so the `container network delete` issued during run teardown often fired before the run's containers had detached, failing with "active containers" or "network has a pending operation". RemoveNetwork made a single attempt and ForceRemoveNetwork just re-issued the identical command, so both gave up and the network leaked — accumulating orphaned networks that eventually exhaust Apple's IP pool. RemoveNetwork now retries with exponential backoff on these transient errors until the detach completes (bounded by networkDeleteMaxAttempts), while non-transient failures still return immediately. The delete command and backoff base are injectable so the retry logic is unit-tested without the real CLI.

Apple's container CLI assigns a container's network address asynchronously, so the `container inspect` issued immediately after `run --detach` could return before the IPv4 address was attached. getContainerIP made a single attempt and failed the whole run with "no network address found for container", even though the address appears moments later. getContainerIP now polls (bounded by containerIPMaxAttempts) until the address is assigned. The address parsing is extracted into a pure, unit-tested helper, and the inspect command and poll interval are injectable so the retry is tested without the real CLI.

The real cause of "no network address found for container" on container CLI 0.12.x: StartService read the container ID from CombinedOutput() of `container run --detach`, but newer container CLI versions write startup progress ("[1/6] Fetching image", ...) to stderr. The captured ID was therefore the whole progress blob plus the name, so the follow-up inspect matched no container and the run aborted — even though the sidecar was running fine with an address. Read the ID from stdout only (stderr is captured separately for error messages) and parse it as the last non-empty line for resilience. This is the actual fix for the startup failure; the earlier address-polling change remains as defense-in-depth.

abezzub-dr added 2 commits June 2, 2026 16:57

docs(changelog): add Apple network-leak retry fix

25a308b

abezzub-dr marked this pull request as draft June 2, 2026 15:03

abezzub-dr added 2 commits June 3, 2026 14:09

docs(changelog): add Apple service container IP poll fix

ead4d3f

abezzub-dr changed the title ~~fix(container): retry Apple network deletion through async container detach~~ fix(container): harden Apple async container ops (network delete + service IP) Jun 3, 2026

abezzub-dr marked this pull request as ready for review June 3, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(container): harden Apple async container ops (network delete + service IP)#367

fix(container): harden Apple async container ops (network delete + service IP)#367
abezzub-dr wants to merge 5 commits into
majorcontext:mainfrom
abezzub-dr:fix/apple-network-leak-retry

abezzub-dr commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abezzub-dr commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Orphaned networks exhaust the IP pool (teardown)

2. Service runs abort with "no network address found for container" (the blocker)

3. Service container IP polling (defense-in-depth)

Tests

Scope / follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abezzub-dr commented Jun 2, 2026 •

edited

Loading