Skip to content

feat: add hermetic LongMemEval harness foundation#1024

Open
wbugitlab1 wants to merge 3 commits into
mainfrom
issue/313-longmemeval-harness
Open

feat: add hermetic LongMemEval harness foundation#1024
wbugitlab1 wants to merge 3 commits into
mainfrom
issue/313-longmemeval-harness

Conversation

@wbugitlab1

Copy link
Copy Markdown
Owner

Summary

  • Add a hermetic benchmark/longmemeval/ foundation for issue feat(bench): LongMemEval-S harness with statistical rigor + CI gate #313 with fixture data validation, six system definitions, manifest hashing/redaction, statistical utilities, markdown table rendering, and a local check entrypoint.
  • Add deterministic LongMemEval harness tests and bench:longmemeval:check.
  • Document that provider-backed reader/judge runs, real dataset download/submodule policy, historical QA baselines, and real CI benchmark gates remain approval-required future work.

Refs #313

Verification

  • corepack pnpm exec vitest run test/longmemeval-harness.test.ts test/eval-adapters.test.ts test/quality-gates.test.ts passed: 3 files / 43 tests.
  • corepack pnpm run bench:longmemeval:check passed: { ok: true, fixtureRows: 3, systems: 6, smokeIds: 50 }.
  • corepack pnpm run lint passed.
  • corepack pnpm test passed after base merge: 212 files / 2920 tests. One earlier full-suite post-merge run hit a transient test/codex-sdk-provider.test.ts 2000ms timeout; the isolated file then passed and the full rerun passed.
  • semgrep scan --config p/default --error --metrics=off . passed: 0 findings.
  • gitleaks protect --staged --redact passed: no leaks. It initially caught two synthetic redaction-test literals, which were changed to runtime-composed fakes before the passing run.
  • OSV was not run because this change does not alter dependencies, lockfiles, container images, vendored code, or third-party package surfaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant