Adding Evals for date extraction#450
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (45.45%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #450 +/- ##
==========================================
+ Coverage 56.11% 56.14% +0.02%
==========================================
Files 63 63
Lines 6080 6086 +6
Branches 591 591
==========================================
+ Hits 3412 3417 +5
- Misses 2598 2599 +1
Partials 70 70
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds an evals/ test suite intended to run live, billable LLM-based quality evaluations (starting with date extraction) alongside the existing unit/integration test structure, with baseline artifacts committed for regression gating.
Changes:
- Introduces an eval harness (
evals/utilities/*) to compute per-feature metrics + write breakdown/metrics artifacts, plus a first suite forextract_date. - Updates test configuration to deselect evals by default and adds a Pixi task to run eval suites explicitly.
- Extends cost utilities/tests to compute aggregate prompt/response token totals alongside cost.
Reviewed changes
Copilot reviewed 24 out of 90 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python/unit/utilities/test_utilities_costs.py | Adds unit tests for new token/cost aggregation helpers. |
| compass/utilities/costs.py | Adds helpers to sum tokens and return {cost, prompt_tokens, response_tokens} from usage totals. |
| compass/utilities/init.py | Re-exports the new cost/token aggregation helpers. |
| pyproject.toml | Adds statsmodels to test extras; configures pytest to deselect evals by default; registers the evals marker and includes evals/ in testpaths. |
| pixi.toml | Adds a pixi run evals <name> task; adds git-lfs and statsmodels to test env; adds openpyxl to dev env. |
| pixi.lock | Locks new/updated dependencies (incl. statsmodels, git-lfs, openpyxl and transitive deps). |
| evals/utilities/base.py | Defines shared Result schema, SUCCESS/FAILURE constants, and document loading helpers for eval suites. |
| evals/utilities/metrics.py | Adds metric computation (accuracy/precision/recall/F1 + Wilson CIs) for eval results. |
| evals/utilities/reports.py | Adds report writer (CSV/JSON), baseline comparison helpers, and terminal summary output. |
| evals/utilities/init.py | Introduces the eval utilities package. |
| evals/conftest.py | Registers the --held-out flag for eval suites. |
| evals/test_run_date_extraction_evals.py | Implements the date extraction eval suite (dev vs held-out dataset selection, result capture, regression gate). |
| evals/results/dev/date_extraction_evals.json | Adds committed dev baseline metrics snapshot for gating. |
| evals/results/dev/date_extraction_evals_breakdown.csv | Adds committed dev baseline per-case breakdown for regression detection. |
| evals/results/held_out/date_extraction_evals.json | Adds committed held-out baseline metrics snapshot (no breakdown committed). |
| evals/README.md | Documents how to run evals, the dev vs held-out policy, and the eval suite structure. |
| evals/data/dev/solar/manifest.json5 | Adds dev dataset manifest for date extraction evals. |
| evals/data/dev/solar/Town_of_Pendleton_Niagara_County_New_York.txt | Adds dev dataset document via Git LFS pointer. |
| evals/data/dev/solar/Town_of_Livonia_Livingston_County_New_York.txt | Adds dev dataset document via Git LFS pointer. |
| evals/data/dev/solar/Okanogan_County_Washington.txt | Adds dev dataset document via Git LFS pointer. |
| evals/data/dev/solar/Medina_County_Ohio.txt | Adds dev dataset document via Git LFS pointer. |
| evals/data/held-out/solar/manifest.json5 | Adds held-out dataset manifest for date extraction evals. |
| evals/data/held-out/solar/Town_of_Walton_Delaware_County_New_York.txt | Adds held-out dataset document via Git LFS pointer. |
| evals/data/held-out/solar/Town_of_Niagara_Niagara_County_New_York.txt | Adds held-out dataset document via Git LFS pointer. |
| evals/data/held-out/solar/City_of_Agawam_Town_Hampden_County_Massachusetts.txt | Adds held-out dataset document via Git LFS pointer. |
| .gitattributes | Configures Git LFS tracking for evals/data/**/*.pdf and evals/data/**/*.txt. |
ppinchuk
left a comment
There was a problem hiding this comment.
Very cool! I am excited to discuss this with you and Gui tomorrow.
A few requests here and there that I hope are not too arduous. Feel free to let me kmow if anything seems out of scope
| "total_prompt_tokens": 32185, | ||
| "total_response_tokens": 4358, | ||
| "total_time_taken_s": 142.48, | ||
| "total_cost_usd": 0.0729 |
There was a problem hiding this comment.
Only 7 cents to run the entire thing? How is it so cheap?
I see that the token count is super low so maybe this number is believable, but it might be worth double checking. Also, we should add the LLM used to this dictionary. In fact, maybe more metadata about the run could be useful here. I'm thinking date of run, COMPASS version, etc.
There was a problem hiding this comment.
Probably because most of the parsing is happening via url parsing - I am exploring adding a "extraction_method" column so we can surface how the data was extracted.
There was a problem hiding this comment.
Regarding adding more metadata - I agree - that would be good to do. I have some questions about the level of details - especially for methods that might use multiple models. Let's discuss.
| def load_doc(fp, *, source=None): | ||
| """Load a local file as an elm Document, dispatching on suffix""" | ||
| fp = Path(fp) | ||
| attrs = {"source": source} if source else {} | ||
| if fp.suffix.casefold() == ".pdf": | ||
| pages = read_pdf(fp.read_bytes(), verbose=False) | ||
| return PDFDocument(pages, attrs=attrs) | ||
| text = fp.read_text(encoding="utf-8", errors="ignore") | ||
| return HTMLDocument([text], attrs=attrs) |
There was a problem hiding this comment.
IDK if it's within the scope of this PR but there are already existing utilities in this repo to read local files (specifically this:
COMPASS/compass/web/file_loader.py
Line 325 in bcd0be9
We should try to re-use as much as possible in order to avoid re-writing code and having to maintain it in multiple places
There was a problem hiding this comment.
Good suggestion. I switched to using production loader that uncovered hidden issue - the load_doc was returning empty doc if the PDF needs OCR.
| def pytest_generate_tests(metafunc): | ||
| """Parametrize ``case`` with the dataset chosen by ``--held-out`` | ||
|
|
||
| Each case gets its resolved document path stamped on as ``case["fp"]`` | ||
| so the test body doesn't need to know which dataset it came from. | ||
| """ | ||
| if "case" not in metafunc.fixturenames: | ||
| return | ||
| dataset_dir = ( | ||
| _HELD_OUT_DATASET_DIR | ||
| if metafunc.config.getoption("--held-out") | ||
| else _DEV_DATASET_DIR | ||
| ) | ||
| cases = load_config(dataset_dir / "manifest.json5") | ||
| for c in cases: | ||
| c["fp"] = dataset_dir / c["file"] | ||
| metafunc.parametrize("case", cases, ids=[c["file"] for c in cases]) |
There was a problem hiding this comment.
It's cool to see how this can be done bespoke, but I wonder if you considered using pytest-cases, which is seemingly designed to do exactly this kind of thing in a more readable way? We already have it as a dependency in our test suite, so it's just a matter of importing it and wiring it up.
There was a problem hiding this comment.
Didn't know about pytest-cases - looks like an awesome library. For this particular instance though, don't think it will help. The main issue here is that we need to load a file to create the test cases - which means the file loading needs to happen before test cases are generated (either natively by pytest, or through pytest-cases). pytest_generate_tests is one of the available standard hook for doing so. I did simplify this to be strictly about case generation and put the case modification logic to a separate fixture that runs during the test.
|
|
||
| @pytest.fixture(scope="module") | ||
| def _model_config(): | ||
| model = "compassop-gpt-5.4" |
There was a problem hiding this comment.
This should probably be read from an environment var so that users don't have to re-hard code it? I think we even have some designated environment variables for LLM model names in our tests, so it may be worth re-using those for consistency
There was a problem hiding this comment.
Hmm, this got me thinking. It seems like evals run should probably take in model config argument allowing defining different models for different tasks? With your work on splitting document extraction and document parsing as separate entry point - maybe the whole eval suite should probably just use the compass cli to do the extraction - making use of the full config.json5. Let's talk more.
|
What about managing those PDFs with pooch? Let's discuss that in the meeting later today. |
… build_local_file_loader_kwargs helper
Adding evals suite. Starting with date_extraction but the idea is to support other pieces of pipeline including whole tech.