Adding Evals for date extraction by rajeee · Pull Request #450 · NatLabRockies/COMPASS

rajeee · 2026-05-27T17:29:29Z

Adding evals suite. Starting with date_extraction but the idea is to support other pieces of pipeline including whole tech.

codecov-commenter · 2026-05-27T17:32:05Z

Codecov Report

❌ Patch coverage is 45.45455% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.14%. Comparing base (eab9df2) to head (74b66c6).
⚠️ Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
compass/scripts/process.py	14.28%	6 Missing ⚠️

❌ Your patch status has failed because the patch coverage (45.45%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #450      +/-   ##
==========================================
+ Coverage   56.11%   56.14%   +0.02%     
==========================================
  Files          63       63              
  Lines        6080     6086       +6     
  Branches      591      591              
==========================================
+ Hits         3412     3417       +5     
- Misses       2598     2599       +1     
  Partials       70       70

Flag	Coverage Δ
unittests	`56.14% <45.45%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds an evals/ test suite intended to run live, billable LLM-based quality evaluations (starting with date extraction) alongside the existing unit/integration test structure, with baseline artifacts committed for regression gating.

Changes:

Introduces an eval harness (evals/utilities/*) to compute per-feature metrics + write breakdown/metrics artifacts, plus a first suite for extract_date.
Updates test configuration to deselect evals by default and adds a Pixi task to run eval suites explicitly.
Extends cost utilities/tests to compute aggregate prompt/response token totals alongside cost.

Reviewed changes

Copilot reviewed 24 out of 90 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/python/unit/utilities/test_utilities_costs.py	Adds unit tests for new token/cost aggregation helpers.
compass/utilities/costs.py	Adds helpers to sum tokens and return `{cost, prompt_tokens, response_tokens}` from usage totals.
compass/utilities/init.py	Re-exports the new cost/token aggregation helpers.
pyproject.toml	Adds `statsmodels` to test extras; configures pytest to deselect `evals` by default; registers the `evals` marker and includes `evals/` in `testpaths`.
pixi.toml	Adds a `pixi run evals <name>` task; adds `git-lfs` and `statsmodels` to test env; adds `openpyxl` to dev env.
pixi.lock	Locks new/updated dependencies (incl. `statsmodels`, `git-lfs`, `openpyxl` and transitive deps).
evals/utilities/base.py	Defines shared `Result` schema, SUCCESS/FAILURE constants, and document loading helpers for eval suites.
evals/utilities/metrics.py	Adds metric computation (accuracy/precision/recall/F1 + Wilson CIs) for eval results.
evals/utilities/reports.py	Adds report writer (CSV/JSON), baseline comparison helpers, and terminal summary output.
evals/utilities/init.py	Introduces the eval utilities package.
evals/conftest.py	Registers the `--held-out` flag for eval suites.
evals/test_run_date_extraction_evals.py	Implements the date extraction eval suite (dev vs held-out dataset selection, result capture, regression gate).
evals/results/dev/date_extraction_evals.json	Adds committed dev baseline metrics snapshot for gating.
evals/results/dev/date_extraction_evals_breakdown.csv	Adds committed dev baseline per-case breakdown for regression detection.
evals/results/held_out/date_extraction_evals.json	Adds committed held-out baseline metrics snapshot (no breakdown committed).
evals/README.md	Documents how to run evals, the dev vs held-out policy, and the eval suite structure.
evals/data/dev/solar/manifest.json5	Adds dev dataset manifest for date extraction evals.
evals/data/dev/solar/Town_of_Pendleton_Niagara_County_New_York.txt	Adds dev dataset document via Git LFS pointer.
evals/data/dev/solar/Town_of_Livonia_Livingston_County_New_York.txt	Adds dev dataset document via Git LFS pointer.
evals/data/dev/solar/Okanogan_County_Washington.txt	Adds dev dataset document via Git LFS pointer.
evals/data/dev/solar/Medina_County_Ohio.txt	Adds dev dataset document via Git LFS pointer.
evals/data/held-out/solar/manifest.json5	Adds held-out dataset manifest for date extraction evals.
evals/data/held-out/solar/Town_of_Walton_Delaware_County_New_York.txt	Adds held-out dataset document via Git LFS pointer.
evals/data/held-out/solar/Town_of_Niagara_Niagara_County_New_York.txt	Adds held-out dataset document via Git LFS pointer.
evals/data/held-out/solar/City_of_Agawam_Town_Hampden_County_Massachusetts.txt	Adds held-out dataset document via Git LFS pointer.
.gitattributes	Configures Git LFS tracking for `evals/data/*/.pdf` and `evals/data/*/.txt`.

…docstring

…format fixes

…docstrings

…source of truth

ppinchuk

Very cool! I am excited to discuss this with you and Gui tomorrow.

A few requests here and there that I hope are not too arduous. Feel free to let me kmow if anything seems out of scope

ppinchuk · 2026-05-27T23:25:09Z

+    "total_prompt_tokens": 32185,
+    "total_response_tokens": 4358,
+    "total_time_taken_s": 142.48,
+    "total_cost_usd": 0.0729


Only 7 cents to run the entire thing? How is it so cheap?

I see that the token count is super low so maybe this number is believable, but it might be worth double checking. Also, we should add the LLM used to this dictionary. In fact, maybe more metadata about the run could be useful here. I'm thinking date of run, COMPASS version, etc.

Probably because most of the parsing is happening via url parsing - I am exploring adding a "extraction_method" column so we can surface how the data was extracted.

Regarding adding more metadata - I agree - that would be good to do. I have some questions about the level of details - especially for methods that might use multiple models. Let's discuss.

ppinchuk · 2026-05-27T23:36:03Z

+def load_doc(fp, *, source=None):
+    """Load a local file as an elm Document, dispatching on suffix"""
+    fp = Path(fp)
+    attrs = {"source": source} if source else {}
+    if fp.suffix.casefold() == ".pdf":
+        pages = read_pdf(fp.read_bytes(), verbose=False)
+        return PDFDocument(pages, attrs=attrs)
+    text = fp.read_text(encoding="utf-8", errors="ignore")
+    return HTMLDocument([text], attrs=attrs)


IDK if it's within the scope of this PR but there are already existing utilities in this repo to read local files (specifically this:

COMPASS/compass/web/file_loader.py

Line 325 in bcd0be9

COMPASSLocalFileLoader = AsyncLocalFileLoader

)

We should try to re-use as much as possible in order to avoid re-writing code and having to maintain it in multiple places

Good suggestion. I switched to using production loader that uncovered hidden issue - the load_doc was returning empty doc if the PDF needs OCR.

ppinchuk · 2026-05-27T23:58:59Z

+def pytest_generate_tests(metafunc):
+    """Parametrize ``case`` with the dataset chosen by ``--held-out``
+
+    Each case gets its resolved document path stamped on as ``case["fp"]``
+    so the test body doesn't need to know which dataset it came from.
+    """
+    if "case" not in metafunc.fixturenames:
+        return
+    dataset_dir = (
+        _HELD_OUT_DATASET_DIR
+        if metafunc.config.getoption("--held-out")
+        else _DEV_DATASET_DIR
+    )
+    cases = load_config(dataset_dir / "manifest.json5")
+    for c in cases:
+        c["fp"] = dataset_dir / c["file"]
+    metafunc.parametrize("case", cases, ids=[c["file"] for c in cases])


It's cool to see how this can be done bespoke, but I wonder if you considered using pytest-cases, which is seemingly designed to do exactly this kind of thing in a more readable way? We already have it as a dependency in our test suite, so it's just a matter of importing it and wiring it up.

Didn't know about pytest-cases - looks like an awesome library. For this particular instance though, don't think it will help. The main issue here is that we need to load a file to create the test cases - which means the file loading needs to happen before test cases are generated (either natively by pytest, or through pytest-cases). pytest_generate_tests is one of the available standard hook for doing so. I did simplify this to be strictly about case generation and put the case modification logic to a separate fixture that runs during the test.

ppinchuk · 2026-05-28T00:00:33Z

+
+@pytest.fixture(scope="module")
+def _model_config():
+    model = "compassop-gpt-5.4"


This should probably be read from an environment var so that users don't have to re-hard code it? I think we even have some designated environment variables for LLM model names in our tests, so it may be worth re-using those for consistency

Hmm, this got me thinking. It seems like evals run should probably take in model config argument allowing defining different models for different tasks? With your work on splitting document extraction and document parsing as separate entry point - maybe the whole eval suite should probably just use the compass cli to do the extraction - making use of the full config.json5. Let's talk more.

castelao · 2026-05-28T13:51:12Z

What about managing those PDFs with pooch? Let's discuss that in the meeting later today.

… build_local_file_loader_kwargs helper

rajeee added 4 commits May 27, 2026 12:20

Add token + cost rollup helpers in compass.utilities.costs

c48dc02

Set up pytest deps and Git LFS infra for evals

e7d02f1

Add date-extraction eval suite under evals/ (PDFs via LFS)

53fc579

Wire up evals in pytest: testpaths, evals marker, deselect by default

c24a3b7

rajeee mentioned this pull request May 27, 2026

Improve date extraction #432

Closed

Add unit tests for token/cost rollup helpers in compass.utilities.costs

c44bd34

rajeee changed the title ~~Date parsing clean~~ Adding Evals for date parsing May 27, 2026

rajeee changed the title ~~Adding Evals for date parsing~~ Adding Evals for date extraction May 27, 2026

rajeee marked this pull request as ready for review May 27, 2026 17:54

rajeee requested review from castelao and ppinchuk as code owners May 27, 2026 17:54

Copilot AI review requested due to automatic review settings May 27, 2026 17:54

Copilot started reviewing on behalf of rajeee May 27, 2026 17:55 View session

rajeee marked this pull request as draft May 27, 2026 17:55

rajeee added 3 commits May 27, 2026 12:56

pixi evals task: stream per-case INFO logs live (--log-cli-level=INFO)

2e7ee27

pixi evals task: pin to ptest env (no more env prompt)

9f84725

Document pixi -- separator for passing flags to pytest (e.g. --held-out)

313033e

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread compass/utilities/costs.py

Comment thread evals/utilities/base.py

Comment thread pixi.toml

rajeee added 3 commits May 27, 2026 14:33

Refresh date_extraction eval baselines

e6e86c6

Address Copilot review: drop stray openpyxl pixi dep, trim redundant …

4b4485b

…docstring

Drop redundant usage comment for evals pixi task (lives in README)

880e33e

rajeee marked this pull request as ready for review May 27, 2026 19:57

rajeee added 3 commits May 27, 2026 15:15

Lint-compliance for evals/: ruff per-file-ignores, package __init__, …

ec597fe

…format fixes

Split _read_breakdown_csv from _get_regressed_jurisdictions; tighten …

fcdddcc

…docstrings

Tighten evals report contract: narrative order, shorter keys, single …

1d6aab3

…source of truth

ppinchuk requested changes May 28, 2026

View reviewed changes

rajeee added 2 commits May 28, 2026 08:53

Split case parametrization via indirect fixture; trim docstring

7b58ec0

Reuse production file loader (with OCR) for eval doc loading; extract…

96682db

… build_local_file_loader_kwargs helper

rajeee added 3 commits May 28, 2026 11:06

Add per-case eval logging via COMPASS LogListener + LocationFileLog

8c6f981

Import statsmodels at top of metrics; drop evals from default testpaths

b146ab6

Use pandas for breakdown CSV read/write (Paul review)

74b66c6

Conversation

rajeee commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppinchuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ppinchuk May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

castelao commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rajeee commented May 27, 2026 •

edited

Loading

codecov-commenter commented May 27, 2026 •

edited

Loading

ppinchuk May 27, 2026 •

edited

Loading