fix(container): harden Apple async container ops (network delete + service IP)#367
Open
abezzub-dr wants to merge 5 commits into
Open
fix(container): harden Apple async container ops (network delete + service IP)#367abezzub-dr wants to merge 5 commits into
abezzub-dr wants to merge 5 commits into
Conversation
…detach Apple's container CLI removes containers asynchronously, so the `container network delete` issued during run teardown often fired before the run's containers had detached, failing with "active containers" or "network has a pending operation". RemoveNetwork made a single attempt and ForceRemoveNetwork just re-issued the identical command, so both gave up and the network leaked — accumulating orphaned networks that eventually exhaust Apple's IP pool. RemoveNetwork now retries with exponential backoff on these transient errors until the detach completes (bounded by networkDeleteMaxAttempts), while non-transient failures still return immediately. The delete command and backoff base are injectable so the retry logic is unit-tested without the real CLI.
Apple's container CLI assigns a container's network address asynchronously, so the `container inspect` issued immediately after `run --detach` could return before the IPv4 address was attached. getContainerIP made a single attempt and failed the whole run with "no network address found for container", even though the address appears moments later. getContainerIP now polls (bounded by containerIPMaxAttempts) until the address is assigned. The address parsing is extracted into a pure, unit-tested helper, and the inspect command and poll interval are injectable so the retry is tested without the real CLI.
The real cause of "no network address found for container" on container
CLI 0.12.x: StartService read the container ID from CombinedOutput() of
`container run --detach`, but newer container CLI versions write startup
progress ("[1/6] Fetching image", ...) to stderr. The captured ID was
therefore the whole progress blob plus the name, so the follow-up inspect
matched no container and the run aborted — even though the sidecar was
running fine with an address.
Read the ID from stdout only (stderr is captured separately for error
messages) and parse it as the last non-empty line for resilience. This is
the actual fix for the startup failure; the earlier address-polling change
remains as defense-in-depth.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes for the Apple containers runtime. Two are async-teardown/startup races; one is an ID-parsing bug that
containerCLI 0.12.x exposed. All were found debugging a realmoat claude --runtime applerun.1. Orphaned networks exhaust the IP pool (teardown)
Apple removes containers asynchronously, so the
container network deleteincleanupResourcesoften ran before the run's containers had detached, failing with (from~/.moat/debug/):RemoveNetworkmade a single attempt andForceRemoveNetworkre-issued the same command, so the network leaked; accumulatedmoat-run_*networks eventually exhaust Apple's/24pool (192.168.64–127) and block new runs. Fix:RemoveNetworkretries with exponential backoff until the async detach completes.2. Service runs abort with "no network address found for container" (the blocker)
On
containerCLI 0.12.x, service runs failed immediately. Root cause:StartServiceread the container ID fromCombinedOutput()ofcontainer run --detach, but newercontainerversions write startup progress ([1/6] Fetching image, …) to stderr. The captured "ID" became the whole progress blob plus the name, so the follow-upinspectmatched no container and the run aborted — even though the sidecar was running fine with an address (verified: a sidecar moat rejected wasrunningwith192.168.98.2). Fix: read the ID from stdout only (stderr captured separately for errors), parsed as the last non-empty line.3. Service container IP polling (defense-in-depth)
Apple assigns a container's address asynchronously.
getContainerIPnow polls until the address appears instead of failing on the first empty inspect. (In practicerun --detachblocks until the address is assigned, so this is a safety net rather than the primary fix for #2.)Tests
go test ./internal/container/— pass. New unit tests: retryable-error classifier, retry-until-detached, fail-fast, not-found-as-success, give-up-after-max (network); pure run-ID parser (strips progress lines); pure IPv4 parser, poll-until-assigned, timeout, transient-inspect-error (service).go vet+golangci-lintclean on changed files.Scope / follow-ups (not in this PR)
containerdaemon-state bug that retrying cannot clear —container system stop && startor a reboot is required.cleanOrphanNetworks) only sweeps the default runtime and only reaps networks whose run directory is already gone — making it sweep all initialized runtimes is a separate change.moat clean(with this fix built in) or a manualcontainer network deletesweep to reclaim.🤖 Generated with Claude Code