Skip to content

Update eval harness with richer scenarios and diagnostics#1

Merged
zac merged 11 commits into
mainfrom
eval-cli
Apr 30, 2026
Merged

Update eval harness with richer scenarios and diagnostics#1
zac merged 11 commits into
mainfrom
eval-cli

Conversation

@zac
Copy link
Copy Markdown
Member

@zac zac commented Apr 30, 2026

Summary

  • Expanded the built-in eval set with additional filesystem and catalog scenarios covering API shapes, recovery, and capability minimization.
  • Added multi-step execute support plus clearer runtime/tool guidance so models use top-level returns and avoid unreturned async IIFEs.
  • Tightened filesystem catalog metadata and added diagnostics/tests for missing return values and async IIFE misuse.
  • Re-ran and updated both core and failures r5 baselines after clean live runs.

Testing

  • swift test
  • swift run --package-path Tools/CodeModeEval codemode-eval run
  • Live codemode-eval llm runs for core --repeat 5 and failures --repeat 5
  • Baseline comparison and report generation passed for both suites

@zac zac merged commit 66c4fdf into main Apr 30, 2026
2 checks passed
@zac zac deleted the eval-cli branch April 30, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant