Adopt canonical output contract; drop frontmatter + fix keying + dead --task#4
Merged
Conversation
…tter, fix keying + exit codes + dead --task Route all output writing through the shared ocr-output-contract package (pinned git+https @v0.1.0), mirroring qwen-ocr. deepseek keeps its Ollama/vLLM backend + per-page OCR + figure extraction; the contract owns where bytes go and the metadata shape, so output is byte-structure-identical to sibling engines. Canon adopted: - resolve_output_root: default <input-parent>/ocr/, -o overrides, never required. - relative_key / doc_dir_for / markdown_path_for: input-relative keying, mirroring the input subtree (fixes basename collisions in nested batches). - assemble_pages: one <stem>/<stem>.md per document under '## Page N' headers (no per-page folders, no multi-page data loss). - DocMetadata + write_doc_metadata + RootIndex: dual-level, relpath-keyed sidecars; failures recorded with status=failed; single-file runs write both levels. - RunOutcome: nonzero exit on any document/page failure (uniform single-file/batch). - Empty/whitespace page response is a per-page FAILURE, not a 0-byte success. HIGH bugs fixed: - DROP YAML frontmatter from the .md body; provenance lives in the sidecar only. - --task now reaches the backend prompt (threaded CLI -> process() -> backend); no longer accept-and-ignore. Removed the dead --no-metadata flag (frontmatter is gone and the sidecar is always written per canon). Build: convert pyproject from poetry to PEP 621 + hatchling with [tool.hatch.metadata] allow-direct-references for the git+https dependency. Tests: tests/test_output_contract.py proves conformance via ocr_output_contract.conformance (assert_conforms + ExpectedDoc): multi-page, failure/partial, no-frontmatter, task-reaches-backend, nested-batch, image-dir. Backend is mocked; no live model. Full suite: 56 passed. Deferred follow-ups (NOT in this commit): - token-truncation knob (max_output_tokens / num_predict lever). - the [[a,b,c,d]] bounding-box cleaner in utils.clean_ocr_output. - stale installed bin / pipx copies out of sync with source.
…ge-dir misclassification, truncation Bump the contract pin to v0.1.1 and use its new discovery/fingerprint helpers. - HIGH: resolve the output root FIRST, then discover via iter_input_files so the default <input>/ocr/ (nested in the scan tree) is pruned from discovery and the engine never re-ingests its own .md/figure outputs on re-run. - HIGH: a batch directory with any direct image no longer collapses to a single image-document. _is_image_dir_document treats a dir as ONE document only when it contains ONLY images (no PDFs, no subdirs); otherwise it is a batch tree and all PDFs + images are processed recursively (papers-library use case). - MEDIUM: implement truncation detection. Backends surface finish_reason via a new process_image_with_meta; the processor wires the contract's is_truncated so a length-truncated non-empty page is recorded status=partial/failed, not completed. max_tokens is now configurable (default 8192, was a hardcoded 2048 in vLLM); Ollama gains num_predict + done_reason. - MEDIUM: quiet mode now emits the .md path for already-completed/resumed docs (compute markdown_path_for, verify on disk, pass output_path on the skip branch). - Pass run_fingerprint(model/backend/task/prompt) to is_completed so a re-run under a different task/prompt reprocesses instead of silently reusing cached output. - LOW: add a --raw escape hatch so clean_ocr_output is opt-out. - Cleanups touched while in these files: drop dead config fields (output_dir/extract_images/include_metadata), honor the log level in setup_logging, route figure links through the contract's figure_* helpers.
…n, quiet-skip, dry-run tests - MEDIUM: reconcile the public API. examples/ imported the removed OCRProcessor (ImportError); rewrite all three examples to the functional process() API and export process + discover_documents from the package. - MEDIUM: dry-run now uses the same real discovery (discover_documents) and the same resolved output root as the real run, so the preview cannot diverge. - README: drop removed flags (--extract-images/--no-metadata/-w) and the YAML frontmatter; document the contract output layout, --raw, --max-tokens, and the fingerprint-aware resume. - Tests: add mixed-dir (PDFs + stray image), re-run output-root exclusion, truncation -> partial/failed, quiet-skip emits .md path on resume, and fingerprint-invalidates-on-task-change. Update fixtures for the new max_tokens and the batch-vs-image-dir discovery semantics.
…-checksum idempotency, richer fingerprint, LOW nits
Bump ocr-output-contract pin to @v0.1.2 and clear the round-2 review findings.
MEDIUM — image-dir vs batch-tree classification is now STABLE across runs.
_is_image_dir_document is output-root-aware: it excludes the resolved output
root (is_within_output_root) before deciding, so the default <input>/ocr/ subtree
created by run 1 no longer flips run 2 from "one image-dir document" to a per-image
batch tree (which orphaned run-1 output, emitted page_NNNN_png/ trees, and broke
resume). A first run and a re-run classify the same input identically.
MEDIUM (SYS-02) — an unreadable input no longer aborts the whole batch. The
idempotency pre-check uses safe_checksum (via _safe_doc_checksum): a file that
became unreadable between discovery and processing yields None, is never skipped,
and falls through to the per-doc catch-all that records it status=failed while the
batch continues. The failure-metadata write is checksum-tolerant too.
Richer fingerprint — run_fingerprint now takes extra={raw, dpi, max_tokens,
analyze_figures} so a re-run with any output-affecting flag changed reprocesses
instead of reusing a stale cached result. task is dropped from the fingerprint when
a custom --prompt is set (backends ignore --task then), fixing the over-invalidation
LOW nit.
LOW — DEEPSEEK_OCR_MODEL_NAME / settings.model_name now wins when --model is omitted
(--model default is None, not a click default that shadowed the env var).
LOW — image-dir dry-run page count filters by image suffix, matching the real run
(no over-report on dirs with stray non-image files).
Tests: re-run-classification-stability, SYS-02 batch-continues, raw-flag and
same-prompt/different-task fingerprint behavior, model precedence, dry-run page
count. 57 pass (was 50). ruff check/format clean; conformance via the v0.1.2
assert_conforms harness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adopts the canonical OCR output contract via the shared
ocr-output-contractpackage (v0.1.0), making this engine's output byte-structure-identical to the rest of the fleet:<input-parent>/ocr/<rel>/<stem>/<stem>.mdwith## Page N, no frontmatter, dual-levelmetadata.json(per-doc + root index, input-relative keyed), and a uniform nonzero-exit-on-failure policy.Adopt canonical output contract; drop frontmatter + fix keying + dead --task.
Test plan
56tests pass (incl. conformance viaocr_output_contract.conformance.assert_conforms), backend/API mocked (no GPU/keys needed in CI).Part of the fleet-wide output-contract rollout (see
../ocr/docs/plans/00-output-contract). Do not merge before review.