Adopt canonical output contract; drop frontmatter + fix keying + dead --task by r-uben · Pull Request #4 · r-uben/deepseek-ocr-cli

r-uben · 2026-06-05T12:43:15Z

Summary

Adopts the canonical OCR output contract via the shared ocr-output-contract package (v0.1.0), making this engine's output byte-structure-identical to the rest of the fleet: <input-parent>/ocr/<rel>/<stem>/<stem>.md with ## Page N, no frontmatter, dual-level metadata.json (per-doc + root index, input-relative keyed), and a uniform nonzero-exit-on-failure policy.

Adopt canonical output contract; drop frontmatter + fix keying + dead --task.

Test plan

56 tests pass (incl. conformance via ocr_output_contract.conformance.assert_conforms), backend/API mocked (no GPU/keys needed in CI).
Independently re-run green.

Part of the fleet-wide output-contract rollout (see ../ocr/docs/plans/00-output-contract). Do not merge before review.

…tter, fix keying + exit codes + dead --task Route all output writing through the shared ocr-output-contract package (pinned git+https @v0.1.0), mirroring qwen-ocr. deepseek keeps its Ollama/vLLM backend + per-page OCR + figure extraction; the contract owns where bytes go and the metadata shape, so output is byte-structure-identical to sibling engines. Canon adopted: - resolve_output_root: default <input-parent>/ocr/, -o overrides, never required. - relative_key / doc_dir_for / markdown_path_for: input-relative keying, mirroring the input subtree (fixes basename collisions in nested batches). - assemble_pages: one <stem>/<stem>.md per document under '## Page N' headers (no per-page folders, no multi-page data loss). - DocMetadata + write_doc_metadata + RootIndex: dual-level, relpath-keyed sidecars; failures recorded with status=failed; single-file runs write both levels. - RunOutcome: nonzero exit on any document/page failure (uniform single-file/batch). - Empty/whitespace page response is a per-page FAILURE, not a 0-byte success. HIGH bugs fixed: - DROP YAML frontmatter from the .md body; provenance lives in the sidecar only. - --task now reaches the backend prompt (threaded CLI -> process() -> backend); no longer accept-and-ignore. Removed the dead --no-metadata flag (frontmatter is gone and the sidecar is always written per canon). Build: convert pyproject from poetry to PEP 621 + hatchling with [tool.hatch.metadata] allow-direct-references for the git+https dependency. Tests: tests/test_output_contract.py proves conformance via ocr_output_contract.conformance (assert_conforms + ExpectedDoc): multi-page, failure/partial, no-frontmatter, task-reaches-backend, nested-batch, image-dir. Backend is mocked; no live model. Full suite: 56 passed. Deferred follow-ups (NOT in this commit): - token-truncation knob (max_output_tokens / num_predict lever). - the [[a,b,c,d]] bounding-box cleaner in utils.clean_ocr_output. - stale installed bin / pipx copies out of sync with source.

…ge-dir misclassification, truncation Bump the contract pin to v0.1.1 and use its new discovery/fingerprint helpers. - HIGH: resolve the output root FIRST, then discover via iter_input_files so the default <input>/ocr/ (nested in the scan tree) is pruned from discovery and the engine never re-ingests its own .md/figure outputs on re-run. - HIGH: a batch directory with any direct image no longer collapses to a single image-document. _is_image_dir_document treats a dir as ONE document only when it contains ONLY images (no PDFs, no subdirs); otherwise it is a batch tree and all PDFs + images are processed recursively (papers-library use case). - MEDIUM: implement truncation detection. Backends surface finish_reason via a new process_image_with_meta; the processor wires the contract's is_truncated so a length-truncated non-empty page is recorded status=partial/failed, not completed. max_tokens is now configurable (default 8192, was a hardcoded 2048 in vLLM); Ollama gains num_predict + done_reason. - MEDIUM: quiet mode now emits the .md path for already-completed/resumed docs (compute markdown_path_for, verify on disk, pass output_path on the skip branch). - Pass run_fingerprint(model/backend/task/prompt) to is_completed so a re-run under a different task/prompt reprocesses instead of silently reusing cached output. - LOW: add a --raw escape hatch so clean_ocr_output is opt-out. - Cleanups touched while in these files: drop dead config fields (output_dir/extract_images/include_metadata), honor the log level in setup_logging, route figure links through the contract's figure_* helpers.

…n, quiet-skip, dry-run tests - MEDIUM: reconcile the public API. examples/ imported the removed OCRProcessor (ImportError); rewrite all three examples to the functional process() API and export process + discover_documents from the package. - MEDIUM: dry-run now uses the same real discovery (discover_documents) and the same resolved output root as the real run, so the preview cannot diverge. - README: drop removed flags (--extract-images/--no-metadata/-w) and the YAML frontmatter; document the contract output layout, --raw, --max-tokens, and the fingerprint-aware resume. - Tests: add mixed-dir (PDFs + stray image), re-run output-root exclusion, truncation -> partial/failed, quiet-skip emits .md path on resume, and fingerprint-invalidates-on-task-change. Update fixtures for the new max_tokens and the batch-vs-image-dir discovery semantics.

…-checksum idempotency, richer fingerprint, LOW nits Bump ocr-output-contract pin to @v0.1.2 and clear the round-2 review findings. MEDIUM — image-dir vs batch-tree classification is now STABLE across runs. _is_image_dir_document is output-root-aware: it excludes the resolved output root (is_within_output_root) before deciding, so the default <input>/ocr/ subtree created by run 1 no longer flips run 2 from "one image-dir document" to a per-image batch tree (which orphaned run-1 output, emitted page_NNNN_png/ trees, and broke resume). A first run and a re-run classify the same input identically. MEDIUM (SYS-02) — an unreadable input no longer aborts the whole batch. The idempotency pre-check uses safe_checksum (via _safe_doc_checksum): a file that became unreadable between discovery and processing yields None, is never skipped, and falls through to the per-doc catch-all that records it status=failed while the batch continues. The failure-metadata write is checksum-tolerant too. Richer fingerprint — run_fingerprint now takes extra={raw, dpi, max_tokens, analyze_figures} so a re-run with any output-affecting flag changed reprocesses instead of reusing a stale cached result. task is dropped from the fingerprint when a custom --prompt is set (backends ignore --task then), fixing the over-invalidation LOW nit. LOW — DEEPSEEK_OCR_MODEL_NAME / settings.model_name now wins when --model is omitted (--model default is None, not a click default that shadowed the env var). LOW — image-dir dry-run page count filters by image suffix, matching the real run (no over-report on dirs with stray non-image files). Tests: re-run-classification-stability, SYS-02 batch-continues, raw-flag and same-prompt/different-task fingerprint behavior, model precedence, dry-run page count. 57 pass (was 50). ruff check/format clean; conformance via the v0.1.2 assert_conforms harness.

r-uben added 9 commits June 5, 2026 13:00

Remove orphaned metadata.py (superseded by ocr-output-contract)

72539e5

Apply ruff format (satisfy CI format check)

5f2563a

Add CI workflow (lint/format/test gate)

773e340

Fix CI: cache off pyproject, drop --cov (no pytest-cov)

3995ec0

Adopt ocr-output-contract v0.1.3; valid failure-checksum sentinel

db8fb40

r-uben merged commit ddfce7a into main Jun 6, 2026
2 checks passed

r-uben deleted the fix/deepseek-canon branch June 6, 2026 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt canonical output contract; drop frontmatter + fix keying + dead --task#4

Adopt canonical output contract; drop frontmatter + fix keying + dead --task#4
r-uben merged 9 commits into
mainfrom
fix/deepseek-canon

r-uben commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

r-uben commented Jun 5, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant