Skip to content

feat: add Inspect AI integration package (learning-commons-inspect-scorers)#102

Draft
adnanrhussain wants to merge 3 commits into
ahussain/sdk-llm-protocolfrom
ahussain/inspect-integration
Draft

feat: add Inspect AI integration package (learning-commons-inspect-scorers)#102
adnanrhussain wants to merge 3 commits into
ahussain/sdk-llm-protocolfrom
ahussain/inspect-integration

Conversation

@adnanrhussain

Copy link
Copy Markdown
Contributor

Summary

Adds integrations/inspect-python (learning-commons-inspect-scorers) — the Inspect AI integration for the LC evaluators SDK.

Stacked on #100 (LLMGeneratorProtocol). Merge #100 first, then retarget this to main.

This is split out from the combined integrations PR (#101) so the Inspect path — the only one with a real consumer (edu-panda-skill-harness) — can be validated and merged independently. The speculative observability integrations (Arize, Langfuse, Braintrust) remain on #101 to be revisited per-vendor once each is validated against a real account.

What's here

  • InspectModelAdapter — wraps Inspect's get_model() to satisfy LLMGeneratorProtocol, so the GLA evaluator runs through Inspect's own model system (no separate API keys).
  • gla_scorer() — an Inspect @scorer for grade-level appropriateness. Reads target_grade from sample metadata, scores completion or artifacts, returns CORRECT/INCORRECT with Score.unscored() for skip/error paths.
  • Entry-point registration so the scorer is discoverable via inspect score --scorer learning_commons_inspect_scorers/gla_scorer.

Review fixes already applied

  • inspect-ai>=0.3.214Score.unscored() (used in every skip/error path) was added in 0.3.214; the previous >=0.3.2 bound would AttributeError at runtime on older installs.
  • Import sorting / formatting cleaned (no CI was running ruff on integration packages — see CI note below).

Test plan

  • 33 tests pass (unit: score routing, band logic, artifact reading; integration: eval() with mockllm/model)
  • Follow-up: wire integration-package CI (ruff + pytest) — triggered only when the Python SDK version is bumped or integrations/inspect-python/** changes.

🤖 Generated with Claude Code

…orers)

Adds integrations/inspect-python with InspectModelAdapter (wraps Inspect's
get_model() to satisfy LLMGeneratorProtocol) and gla_scorer() — an Inspect
scorer for grade-level appropriateness, wired to the GLA evaluator via the
injected-provider protocol from the SDK.

- inspect-ai>=0.3.214 (Score.unscored() lower bound)
- registered for independent release via release-please

Split out from the combined integrations PR so the Inspect path — which has a
real consumer (edu-panda-skill-harness) — can be validated and merged on its own.
test-inspect-python.yml: runs ruff + pytest across Python 3.10–3.13. Triggered
only when integrations/inspect-python/** or sdks/python/** changes (an SDK
change, including a version bump, can break the integration). Installs the
in-repo SDK from source first because the integration needs LLMGeneratorProtocol
(SDK 0.3.0), which is not yet on PyPI.

publish-inspect-python.yml: builds + publishes to PyPI on
integrations-inspect-python-v* release tags, mirroring publish-python-sdk.yml.
Header documents the two pre-publish steps: tighten the SDK floor to >=0.3.0 and
remove release-as after 0.1.0 ships.
Score.unscored() records a NaN value. Custom report renderers (e.g. the
edu-panda-skill-harness eval report) that normalize scores via isinstance(v,
float) treat that NaN as a real 0–1 score and average it into the mean,
poisoning the whole scorer column to NaN. Returning None omits the sample from
this scorer's results entirely — handled cleanly by every Inspect metric and by
naive renderers — and matches the skip convention used by the harness's other
scorers (and rubric_judge). None is a fully-supported Scorer return per the
Scorer protocol (-> Score | None).

Applies to all three skip/error paths: missing/invalid target_grade, no text,
and transient API/parse errors. Tests updated to assert None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant