Skip to content

Harden ACP/LLM transport and sandbox experiment execution#290

Open
DermotOBrien-EC wants to merge 5 commits into
aiming-lab:mainfrom
DermotOBrien-EC:reliability-fixes
Open

Harden ACP/LLM transport and sandbox experiment execution#290
DermotOBrien-EC wants to merge 5 commits into
aiming-lab:mainfrom
DermotOBrien-EC:reliability-fixes

Conversation

@DermotOBrien-EC

Copy link
Copy Markdown
Contributor

Five focused reliability fixes for long autonomous runs, concentrated in the LLM/ACP transport and the sandbox experiment path.

  • Configurable ACP session-init timeout (default 120s). Cold acpx agent start was observed at ~31s, overrunning the old hardcoded 30s session create/ensure budget and flaking a run's first stage. Adds AcpConfig.session_init_timeout_sec and falls back from ensure to new on TimeoutExpired.
  • Default LLM client timeout aligned to 600s. A direct LLMConfig construction path silently inherited the old 300s default; this matches it to the from_rc_config fallback.
  • Surface acpx stdout in ACP failure diagnostics. acpx often writes the real error to stdout while stderr is empty; the failure message now includes both (tail-capped) and preserves reconnect / command-too-long markers so truncation cannot disable the retry and stdin-fallback paths.
  • Parameterise the sandbox dataset cache root (default /opt/datasets, so existing configs render identical prompts). Also tightens the prompt contract: forbid silent synthetic-data fallback, require FileNotFoundError + non-zero exit on missing cached data, and emit a DATASET_USED: <name> stamp. Note: in this PR the no-synthetic rule and the DATASET_USED stamp are prompt-level guidance only — runtime enforcement of dataset provenance is intentionally deferred to a follow-up PR.
  • Capture structured sandbox metrics. Prefer the experiment's own results.json over stdout-parsed metrics, clear the sandbox before each run, and mtime-anchor the discovery glob so a prior failed run can't leak stale results.

Tests: +678 lines across test_rc_llm.py, test_rc_prompts.py, test_rc_config.py, and test_rc_executor.py.

Verify: pytest tests/test_rc_llm.py tests/test_rc_prompts.py tests/test_rc_config.py tests/test_rc_executor.py

Add experiment.sandbox.dataset_cache_root (default /opt/datasets) and
thread it through the network_disabled_guidance prompt block so
generated experiment code is instructed to load torchvision datasets
from the configured path with download=False. The default value
matches the prior hardcoded constant, so existing configs that omit
the field render identical prompts.

Tighten missing-data semantics: the prompt now forbids silent
synthetic-data fallback and requires FileNotFoundError + non-zero
exit if a pre-cached dataset file is missing. This is an
intentional behaviour change for every sandbox network_policy="none"
codegen call, motivated by a focused-replay defect where missing
MNIST raw files were papered over with synthetic tensors.

Add an explicit DATASET_USED: <name> stdout-stamp requirement so
downstream metric capture has a dataset-provenance signal
independent of whatever JSON result schema CodeAgent invents.

Focused replay (CODE_GENERATION..EXPERIMENT_RUN against MNIST raw
files pre-staged at /tmp/arc_sandbox_trial/datasets) confirms all
three checks: the stamp appears in stdout, run-1.json:metrics
carries 105 per-condition namespaced keys, and the canonical
runs/results.json contains the structured harness output rather
than a stdout-parsed fallback.
Include both tail-capped stdout and stderr in the exit-N ACP failure
message; acpx often writes the real error to stdout while stderr is
empty, so the prior message discarded it. Preserve reconnect and
command-too-long markers verbatim from the raw streams so tail
truncation cannot disable the retry and stdin-fallback paths in
_send_prompt.
Align the LLMConfig dataclass default with the from_rc_config() fallback
of 600s. A direct LLMConfig construction path omitted timeout_sec and
silently inherited the old 300s default.
Cold agent start-up through acpx can exceed the old hardcoded 30s
session create/ensure budget (claude observed at ~31s), flaking the
first stage of a run. Add session_init_timeout_sec (default 120) on
AcpConfig/ACPConfig, thread it through the loader and from_rc_config,
and fall back from 'ensure' to 'new' on TimeoutExpired instead of
aborting session init with an unhandled exception.
@DermotOBrien-EC DermotOBrien-EC marked this pull request as ready for review June 4, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant