feat: add Arize, Langfuse, and Braintrust integration packages by adnanrhussain · Pull Request #101 · learning-commons-org/evaluators

adnanrhussain · 2026-06-11T22:46:50Z

Summary

Adds three speculative observability integration packages under integrations/:

arize-python → learning-commons-arize-scorers — PhoenixTracingAdapter (OTel decorator)
langfuse-python → learning-commons-langfuse-scorers — LangfuseTracingAdapter (Langfuse v2)
braintrust-python → learning-commons-braintrust-scorers — BraintrustAnthropicAdapter + BraintrustProxyAdapter

Stacked on #100 (LLMGeneratorProtocol).

Status: parked pending per-vendor validation

The Inspect integration was split out into #100's child inspect PR because it has a real consumer and can be validated now. These three have no current consumer and carry validation risk that doc-reading can't resolve (real vendor accounts needed). Plan is to revisit and likely split this PR further by vendor.

High-confidence fix already applied: Langfuse generation.end() now uses usage_details (the deprecated usage kwarg is silently dropped in recent 2.x).

Deferred to per-vendor revisit (need real-account validation):

Arize: span name llm.generate → chat; missing llm.system attribute
Braintrust: invalid default model ID, proxy base_url trailing slash, init() → init_logger(), braintrust>=0.0.100 bound too low
Publish workflows + release-as cleanup for all three

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds four new Python integration packages under integrations/ that provide LLMGeneratorProtocol-compatible adapters and/or scorer wrappers for external observability/eval platforms (Inspect AI, Arize/Phoenix via OTel, Langfuse, Braintrust), plus release-please configuration to version them independently.

Changes:

Introduce new adapter packages: learning-commons-inspect-scorers, learning-commons-arize-scorers, learning-commons-langfuse-scorers, learning-commons-braintrust-scorers.
Add unit/integration tests for each adapter/scorer package.
Register the new integration packages in release-please config + manifest for independent releases.

Reviewed changes

Copilot reviewed 29 out of 33 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
release-please-config.json	Adds release-please package entries for the four new integration packages.
.release-please-manifest.json	Adds initial versions (`0.1.0`) for the four new integration packages.
integrations/langfuse-python/tests/test_adapter.py	Tests LangfuseTracingAdapter generation lifecycle + error handling + flush.
integrations/langfuse-python/src/learning_commons_langfuse_scorers/py.typed	Marks package as typed.
integrations/langfuse-python/src/learning_commons_langfuse_scorers/adapter.py	Implements LangfuseTracingAdapter decorator over `generate()`.
integrations/langfuse-python/src/learning_commons_langfuse_scorers/init.py	Exposes LangfuseTracingAdapter in public package API.
integrations/langfuse-python/pyproject.toml	Package metadata and dependencies for Langfuse integration.
integrations/langfuse-python/CHANGELOG.md	Initializes changelog for the package.
integrations/langfuse-python/.gitignore	Build/test cache ignores for the package.
integrations/inspect-python/tests/test_gla_scorer.py	Tests Inspect scorer wrapper behavior and eval() wiring.
integrations/inspect-python/src/learning_commons_inspect_scorers/py.typed	Marks package as typed.
integrations/inspect-python/src/learning_commons_inspect_scorers/gla.py	Implements `gla_scorer()` Inspect scorer wrapper around LC GLA evaluator.
integrations/inspect-python/src/learning_commons_inspect_scorers/adapter.py	Implements InspectModelAdapter that adapts Inspect `get_model()` to `generate()`.
integrations/inspect-python/src/learning_commons_inspect_scorers/_registry.py	Registers scorers via Inspect entry point import side-effect.
integrations/inspect-python/src/learning_commons_inspect_scorers/init.py	Exposes InspectModelAdapter and `gla_scorer()` publicly.
integrations/inspect-python/README.md	Adds package documentation and usage examples.
integrations/inspect-python/pyproject.toml	Package metadata, deps, and Inspect entry point registration.
integrations/inspect-python/CHANGELOG.md	Initializes changelog for the package.
integrations/inspect-python/.gitignore	Build/test cache ignores for the package.
integrations/braintrust-python/tests/test_adapter.py	Tests BraintrustAnthropicAdapter and BraintrustProxyAdapter behavior.
integrations/braintrust-python/src/learning_commons_braintrust_scorers/py.typed	Marks package as typed.
integrations/braintrust-python/src/learning_commons_braintrust_scorers/adapter.py	Implements Braintrust Anthropic + proxy adapters with shared base.
integrations/braintrust-python/src/learning_commons_braintrust_scorers/init.py	Exposes Braintrust adapters publicly.
integrations/braintrust-python/pyproject.toml	Package metadata and dependencies for Braintrust integration.
integrations/braintrust-python/CHANGELOG.md	Initializes changelog with initial release entry.
integrations/braintrust-python/.gitignore	Build/test cache ignores for the package.
integrations/arize-python/tests/test_adapter.py	Tests OTel span emission behavior for PhoenixTracingAdapter.
integrations/arize-python/src/learning_commons_arize_scorers/py.typed	Marks package as typed.
integrations/arize-python/src/learning_commons_arize_scorers/adapter.py	Implements PhoenixTracingAdapter decorator emitting OpenInference/GenAI attrs.
integrations/arize-python/src/learning_commons_arize_scorers/init.py	Exposes PhoenixTracingAdapter publicly.
integrations/arize-python/pyproject.toml	Package metadata and dependencies for Arize/Phoenix integration.
integrations/arize-python/CHANGELOG.md	Initializes changelog for the package.
integrations/arize-python/.gitignore	Build/test cache ignores for the package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+```python
+from inspect_ai import Task, task
+from inspect_ai.dataset import csv_dataset, FieldSpec
+from inspect_ai.solver import generate
+from learning_commons_inspect_scorers import gla_scorer
+from learning_commons_evaluators.config import create_config_no_telemetry
+from learning_commons_evaluators.schemas.config import GoogleLLMProviderConfig
+
+config = create_config_no_telemetry(
+    google_llm_provider_config=GoogleLLMProviderConfig(api_key="your-key"),
+)
+
+@task
+def my_eval():
+    return Task(
+        dataset=csv_dataset("samples.csv"),  # requires target_grade column
+        solver=[generate()],
+        scorer=gla_scorer(config=config),
+    )
+```


+```python
+scorer=gla_scorer(config=config, text_source="artifacts")
+```


+| Parameter | Default | Description |
+|---|---|---|
+| `config` | env vars | `EvaluatorConfig`. If `None`, reads `GOOGLE_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY` from the environment. |
+| `text_source` | `"completion"` | `"completion"` scores `state.output.completion`; `"artifacts"` joins `state.metadata["artifacts"]` file contents. |
+| `target_grade_key` | `"target_grade"` | Metadata key holding the expected grade band. |
+| `allow_adjacent` | `True` | If `True`, the one grade band above or below the target also passes. |


+name = "learning-commons-langfuse-scorers"
+version = "0.1.0"
+description = "Langfuse tracing adapter for Learning Commons evaluators"
+readme = "README.md"


+name = "learning-commons-arize-scorers"
+version = "0.1.0"
+description = "Arize/Phoenix OTel tracing adapter for Learning Commons evaluators"
+readme = "README.md"


+name = "learning-commons-braintrust-scorers"
+version = "0.1.0"
+description = "Braintrust adapter for Learning Commons evaluators"
+readme = "README.md"


+        with (
+            patch("braintrust.auto_instrument", mock_braintrust.auto_instrument),
+            patch.dict("sys.modules", {"braintrust": mock_braintrust}),
+            patch("anthropic.AsyncAnthropic", return_value=mock_client),
+        ):


…kages Adds four packages under integrations/ that each implement LLMGeneratorProtocol (introduced in the SDK PR) for their respective platform: integrations/inspect-python → learning-commons-inspect-scorers - InspectModelAdapter wraps Inspect's get_model() so the GLA scorer uses Inspect's model system rather than LangChain directly - gla_scorer() Inspect scorer for grade-level appropriateness evaluation integrations/arize-python → learning-commons-arize-scorers - PhoenixTracingAdapter: OTel decorator that emits OpenInference llm.* and gen_ai.* spans; capture_message_content=False by default (K-12 privacy) integrations/langfuse-python → learning-commons-langfuse-scorers - LangfuseTracingAdapter: decorator that records Langfuse v2 generations; pinned to langfuse<3.0.0 pending migration to OTel-based v3 API integrations/braintrust-python → learning-commons-braintrust-scorers - BraintrustAnthropicAdapter: uses auto_instrument() for transparent tracing - BraintrustProxyAdapter: routes calls through Braintrust AI Proxy with no Braintrust SDK dependency All adapters implement LLMGeneratorProtocol structurally (typing.Protocol) and are composable as decorators: PhoenixTracingAdapter(LangfuseTracingAdapter(InspectModelAdapter("..."))) Also updates release-please-config.json and manifest to track all four packages.

- Remove unused OutputValidationError import from gla.py (F401 lint) - Fix README examples: gla_scorer() takes grader_model not config - Add README.md for arize-python, langfuse-python, braintrust-python - Fix braintrust test: remove patch('braintrust.auto_instrument') which triggered ModuleNotFoundError before sys.modules mock was applied

…use/braintrust The Inspect integration moves to ahussain/inspect-integration (its own PR) since it has a real consumer and can be validated/merged independently. This branch now carries only the speculative observability integrations (arize, langfuse, braintrust) to be revisited — likely split further by vendor — once each is validated against a real account. Also applies the high-confidence Langfuse fix: generation.end() now uses the current usage_details kwarg instead of the deprecated usage kwarg (silently dropped in recent 2.x, losing token counts in the UI). Deferred to the per-vendor revisit (need real-account validation, not doc-reading): arize span-name, braintrust invalid model ID / proxy URL / init_logger / dep bound.

adnanrhussain marked this pull request as draft June 11, 2026 22:57

adnanrhussain requested a review from Copilot June 11, 2026 22:58

Copilot started reviewing on behalf of adnanrhussain June 11, 2026 22:58 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

adnanrhussain force-pushed the ahussain/sdk-llm-protocol branch from d4f80a9 to ca64de2 Compare June 12, 2026 04:42

adnanrhussain force-pushed the ahussain/eval-integrations-packages branch from d58e77a to 4cc5e41 Compare June 12, 2026 05:04

adnanrhussain added 3 commits June 11, 2026 22:12

style: remove redundant inline comments, keep non-obvious WHY comments

e4ba334

adnanrhussain force-pushed the ahussain/eval-integrations-packages branch from cac8e13 to e4ba334 Compare June 12, 2026 05:13

adnanrhussain mentioned this pull request Jun 12, 2026

feat: add Inspect AI integration package (learning-commons-inspect-scorers) #102

Draft

2 tasks

adnanrhussain changed the title ~~feat: add Inspect AI, Arize, Langfuse, and Braintrust integration packages~~ feat: add Arize, Langfuse, and Braintrust integration packages Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Arize, Langfuse, and Braintrust integration packages#101

feat: add Arize, Langfuse, and Braintrust integration packages#101
adnanrhussain wants to merge 4 commits into
ahussain/sdk-llm-protocolfrom
ahussain/eval-integrations-packages

adnanrhussain commented Jun 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adnanrhussain commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status: parked pending per-vendor validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adnanrhussain commented Jun 11, 2026 •

edited

Loading