feat: add Inspect AI integration package (learning-commons-inspect-scorers)#102
Draft
adnanrhussain wants to merge 3 commits into
Draft
feat: add Inspect AI integration package (learning-commons-inspect-scorers)#102adnanrhussain wants to merge 3 commits into
adnanrhussain wants to merge 3 commits into
Conversation
…orers) Adds integrations/inspect-python with InspectModelAdapter (wraps Inspect's get_model() to satisfy LLMGeneratorProtocol) and gla_scorer() — an Inspect scorer for grade-level appropriateness, wired to the GLA evaluator via the injected-provider protocol from the SDK. - inspect-ai>=0.3.214 (Score.unscored() lower bound) - registered for independent release via release-please Split out from the combined integrations PR so the Inspect path — which has a real consumer (edu-panda-skill-harness) — can be validated and merged on its own.
test-inspect-python.yml: runs ruff + pytest across Python 3.10–3.13. Triggered only when integrations/inspect-python/** or sdks/python/** changes (an SDK change, including a version bump, can break the integration). Installs the in-repo SDK from source first because the integration needs LLMGeneratorProtocol (SDK 0.3.0), which is not yet on PyPI. publish-inspect-python.yml: builds + publishes to PyPI on integrations-inspect-python-v* release tags, mirroring publish-python-sdk.yml. Header documents the two pre-publish steps: tighten the SDK floor to >=0.3.0 and remove release-as after 0.1.0 ships.
Score.unscored() records a NaN value. Custom report renderers (e.g. the edu-panda-skill-harness eval report) that normalize scores via isinstance(v, float) treat that NaN as a real 0–1 score and average it into the mean, poisoning the whole scorer column to NaN. Returning None omits the sample from this scorer's results entirely — handled cleanly by every Inspect metric and by naive renderers — and matches the skip convention used by the harness's other scorers (and rubric_judge). None is a fully-supported Scorer return per the Scorer protocol (-> Score | None). Applies to all three skip/error paths: missing/invalid target_grade, no text, and transient API/parse errors. Tests updated to assert None.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
integrations/inspect-python(learning-commons-inspect-scorers) — the Inspect AI integration for the LC evaluators SDK.This is split out from the combined integrations PR (#101) so the Inspect path — the only one with a real consumer (
edu-panda-skill-harness) — can be validated and merged independently. The speculative observability integrations (Arize, Langfuse, Braintrust) remain on #101 to be revisited per-vendor once each is validated against a real account.What's here
InspectModelAdapter— wraps Inspect'sget_model()to satisfyLLMGeneratorProtocol, so the GLA evaluator runs through Inspect's own model system (no separate API keys).gla_scorer()— an Inspect@scorerfor grade-level appropriateness. Readstarget_gradefrom sample metadata, scorescompletionorartifacts, returnsCORRECT/INCORRECTwithScore.unscored()for skip/error paths.inspect score --scorer learning_commons_inspect_scorers/gla_scorer.Review fixes already applied
inspect-ai>=0.3.214—Score.unscored()(used in every skip/error path) was added in 0.3.214; the previous>=0.3.2bound wouldAttributeErrorat runtime on older installs.Test plan
eval()withmockllm/model)integrations/inspect-python/**changes.🤖 Generated with Claude Code