Skip to content

Adopt canonical output contract; drop frontmatter + fix keying + dead --task#4

Merged
r-uben merged 9 commits into
mainfrom
fix/deepseek-canon
Jun 6, 2026
Merged

Adopt canonical output contract; drop frontmatter + fix keying + dead --task#4
r-uben merged 9 commits into
mainfrom
fix/deepseek-canon

Conversation

@r-uben
Copy link
Copy Markdown
Owner

@r-uben r-uben commented Jun 5, 2026

Summary

Adopts the canonical OCR output contract via the shared ocr-output-contract package (v0.1.0), making this engine's output byte-structure-identical to the rest of the fleet: <input-parent>/ocr/<rel>/<stem>/<stem>.md with ## Page N, no frontmatter, dual-level metadata.json (per-doc + root index, input-relative keyed), and a uniform nonzero-exit-on-failure policy.

Adopt canonical output contract; drop frontmatter + fix keying + dead --task.

Test plan

  • 56 tests pass (incl. conformance via ocr_output_contract.conformance.assert_conforms), backend/API mocked (no GPU/keys needed in CI).
  • Independently re-run green.

Part of the fleet-wide output-contract rollout (see ../ocr/docs/plans/00-output-contract). Do not merge before review.

r-uben added 9 commits June 5, 2026 13:00
…tter, fix keying + exit codes + dead --task

Route all output writing through the shared ocr-output-contract package (pinned
git+https @v0.1.0), mirroring qwen-ocr. deepseek keeps its Ollama/vLLM backend +
per-page OCR + figure extraction; the contract owns where bytes go and the
metadata shape, so output is byte-structure-identical to sibling engines.

Canon adopted:
- resolve_output_root: default <input-parent>/ocr/, -o overrides, never required.
- relative_key / doc_dir_for / markdown_path_for: input-relative keying, mirroring
  the input subtree (fixes basename collisions in nested batches).
- assemble_pages: one <stem>/<stem>.md per document under '## Page N' headers
  (no per-page folders, no multi-page data loss).
- DocMetadata + write_doc_metadata + RootIndex: dual-level, relpath-keyed sidecars;
  failures recorded with status=failed; single-file runs write both levels.
- RunOutcome: nonzero exit on any document/page failure (uniform single-file/batch).
- Empty/whitespace page response is a per-page FAILURE, not a 0-byte success.

HIGH bugs fixed:
- DROP YAML frontmatter from the .md body; provenance lives in the sidecar only.
- --task now reaches the backend prompt (threaded CLI -> process() -> backend);
  no longer accept-and-ignore. Removed the dead --no-metadata flag (frontmatter
  is gone and the sidecar is always written per canon).

Build: convert pyproject from poetry to PEP 621 + hatchling with
[tool.hatch.metadata] allow-direct-references for the git+https dependency.

Tests: tests/test_output_contract.py proves conformance via
ocr_output_contract.conformance (assert_conforms + ExpectedDoc): multi-page,
failure/partial, no-frontmatter, task-reaches-backend, nested-batch, image-dir.
Backend is mocked; no live model. Full suite: 56 passed.

Deferred follow-ups (NOT in this commit):
- token-truncation knob (max_output_tokens / num_predict lever).
- the [[a,b,c,d]] bounding-box cleaner in utils.clean_ocr_output.
- stale installed bin / pipx copies out of sync with source.
…ge-dir misclassification, truncation

Bump the contract pin to v0.1.1 and use its new discovery/fingerprint helpers.

- HIGH: resolve the output root FIRST, then discover via iter_input_files so the
  default <input>/ocr/ (nested in the scan tree) is pruned from discovery and the
  engine never re-ingests its own .md/figure outputs on re-run.
- HIGH: a batch directory with any direct image no longer collapses to a single
  image-document. _is_image_dir_document treats a dir as ONE document only when it
  contains ONLY images (no PDFs, no subdirs); otherwise it is a batch tree and all
  PDFs + images are processed recursively (papers-library use case).
- MEDIUM: implement truncation detection. Backends surface finish_reason via a new
  process_image_with_meta; the processor wires the contract's is_truncated so a
  length-truncated non-empty page is recorded status=partial/failed, not completed.
  max_tokens is now configurable (default 8192, was a hardcoded 2048 in vLLM);
  Ollama gains num_predict + done_reason.
- MEDIUM: quiet mode now emits the .md path for already-completed/resumed docs
  (compute markdown_path_for, verify on disk, pass output_path on the skip branch).
- Pass run_fingerprint(model/backend/task/prompt) to is_completed so a re-run under
  a different task/prompt reprocesses instead of silently reusing cached output.
- LOW: add a --raw escape hatch so clean_ocr_output is opt-out.
- Cleanups touched while in these files: drop dead config fields
  (output_dir/extract_images/include_metadata), honor the log level in
  setup_logging, route figure links through the contract's figure_* helpers.
…n, quiet-skip, dry-run tests

- MEDIUM: reconcile the public API. examples/ imported the removed OCRProcessor
  (ImportError); rewrite all three examples to the functional process() API and
  export process + discover_documents from the package.
- MEDIUM: dry-run now uses the same real discovery (discover_documents) and the
  same resolved output root as the real run, so the preview cannot diverge.
- README: drop removed flags (--extract-images/--no-metadata/-w) and the YAML
  frontmatter; document the contract output layout, --raw, --max-tokens, and the
  fingerprint-aware resume.
- Tests: add mixed-dir (PDFs + stray image), re-run output-root exclusion,
  truncation -> partial/failed, quiet-skip emits .md path on resume, and
  fingerprint-invalidates-on-task-change. Update fixtures for the new max_tokens
  and the batch-vs-image-dir discovery semantics.
…-checksum idempotency, richer fingerprint, LOW nits

Bump ocr-output-contract pin to @v0.1.2 and clear the round-2 review findings.

MEDIUM — image-dir vs batch-tree classification is now STABLE across runs.
_is_image_dir_document is output-root-aware: it excludes the resolved output
root (is_within_output_root) before deciding, so the default <input>/ocr/ subtree
created by run 1 no longer flips run 2 from "one image-dir document" to a per-image
batch tree (which orphaned run-1 output, emitted page_NNNN_png/ trees, and broke
resume). A first run and a re-run classify the same input identically.

MEDIUM (SYS-02) — an unreadable input no longer aborts the whole batch. The
idempotency pre-check uses safe_checksum (via _safe_doc_checksum): a file that
became unreadable between discovery and processing yields None, is never skipped,
and falls through to the per-doc catch-all that records it status=failed while the
batch continues. The failure-metadata write is checksum-tolerant too.

Richer fingerprint — run_fingerprint now takes extra={raw, dpi, max_tokens,
analyze_figures} so a re-run with any output-affecting flag changed reprocesses
instead of reusing a stale cached result. task is dropped from the fingerprint when
a custom --prompt is set (backends ignore --task then), fixing the over-invalidation
LOW nit.

LOW — DEEPSEEK_OCR_MODEL_NAME / settings.model_name now wins when --model is omitted
(--model default is None, not a click default that shadowed the env var).
LOW — image-dir dry-run page count filters by image suffix, matching the real run
(no over-report on dirs with stray non-image files).

Tests: re-run-classification-stability, SYS-02 batch-continues, raw-flag and
same-prompt/different-task fingerprint behavior, model precedence, dry-run page
count. 57 pass (was 50). ruff check/format clean; conformance via the v0.1.2
assert_conforms harness.
@r-uben r-uben merged commit ddfce7a into main Jun 6, 2026
2 checks passed
@r-uben r-uben deleted the fix/deepseek-canon branch June 6, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant