diff --git a/.agents/skills/protocol-migration/SKILL.md b/.agents/skills/protocol-migration/SKILL.md
index 55ad0a5..7b806de 100644
--- a/.agents/skills/protocol-migration/SKILL.md
+++ b/.agents/skills/protocol-migration/SKILL.md
@@ -1,75 +1,113 @@
 ---
 name: protocol-migration
-description: Convert legacy/source.txt into README.md using the repository template, preserving scientific meaning and marking uncertainty with CHECK:.
+description: Convert legacy/source.md into README.md using the repository template, using source.txt only as fallback and marking uncertainty with CHECK:.
 ---
 
 Use this skill when migrating a legacy protocol into this repository template.
 
 Goal:
-Convert `legacy/source.txt` into `README.md`, using the existing `README.md` as the target template and structure.
+Convert `legacy/source.md` into `README.md`, using the existing `README.md` as the target template and structure.
 
-Primary sources:
-- `legacy/source.txt` is the main source
-- also consult the PDF in `legacy/` for tables, layout-dependent content, and anything unclear
+## Primary rule
+Do not change protocol meaning.
+Use `legacy/source.md` as the primary source when rewriting `README.md`.
+Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear.
+Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous.
+If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker.
 
-Core rules:
-- Do not change protocol meaning
-- Do not invent missing information
-- Do not delete source content
-- Do not silently summarize, compress, or merge steps
-- Preserve exact reagent names, quantities, timings, temperatures, and conditions unless only formatting is being normalized
-- Preserve step order unless the source clearly indicates otherwise
-- If anything is uncertain, mark it with `CHECK:` instead of guessing
+## Migration behavior
+When converting legacy protocol content into the repository template:
 
-If content does not fit cleanly:
-- place it under `# Migration notes` or `## Unplaced content`
+- Preserve all protocol content.
+- Preserve all procedural content, warnings, notes, reagent names, quantities, timings, temperatures, and conditions, preserving their location.
+- Do not change scientific meaning.
+- Do not invent missing information.
+- Do not invent missing values or steps.
+- Do not delete any content from the source.
+- Do not silently summarize, compress, or merge steps.
+- Keep exact reagent names, quantities, temperatures, timings, and conditions unless only formatting is being normalized.
+- Preserve exact reagent and equipment names unless only formatting is changing.
+- Preserve the step order from the source unless the source clearly indicates otherwise.
+- Do not delete repeated warnings or notes.
+- If any text does not fit cleanly into the template, place it under `# Migration notes` or `## Unplaced content`.
+- Mark uncertainty with `CHECK:` instead of guessing.
 
-Allowed formatting normalization only when meaning is unchanged:
-- add a space between numbers and units
-- standardize temperature formatting to `37 °C`
-- standardize volumes to `µL`, `mL`, `L`
-- standardize concentrations to `mM`, `µM`, `nM`, `% (w/v)`
-- standardize time units to `seconds`, `minutes`, `hours`
-- standardize pH formatting to `pH 7.4`
-- normalize bullets, headings, and markdown tables to match the template
-- use tables for reaction mixes and other tabular content
-- use HTML subscripts for chemical formulas where needed
-- normalize note-like text to blockquote style, e.g. `> **Note**`
+## Allowed formatting normalization
+You may normalize formatting only when the meaning is unchanged and unambiguous:
 
-Do not:
-- infer omitted values
-- replace vague wording like `overnight` or `room temperature` with precise values
-- reorder steps unless clearly justified by the source
-- remove repeated warnings or notes
-- replace one reagent with another
-- omit unmapped text
+- Add a space between numbers and units.
+- Standardize temperature formatting to `37 °C`.
+- Standardize volume units to `µL`, `mL`, `L`, using the micro sign `µ` consistently.
+- Standardize concentration units to `mM`, `µM`, `nM`, `% (w/v)`, etc., using the micro sign `µ` consistently.
+- Standardize time units to full words: `seconds`, `minutes`, `hours`.
+- Standardize chemical names to match the source but with consistent formatting (e.g. `Tris-HCl` instead of `Tris HCl`).
+- Standardize pH formatting to `pH 7.4`.
+- Standardize chemical formulas with HTML subscripts, for example H2O to H<sub>2</sub>O. Similarly for other chemical formulas (e.g. MgCl2 to MgCl<sub>2</sub>).
+- Do not use Unicode subscript characters such as `₂`.
+- Standardize `RNAseq` or `RNA-Seq` to `RNA-seq`. Same for `ChIP-seq`, `ATAC-seq`, etc.
+- Normalize bullet formatting and markdown table formatting.
+- Normalize heading structure to match the repository template.
+- For reaction mixes and anything tabular, place them inside a table as in template.
+- Normalize markdown headings, bullets, and tables.
+- "Note" or "NOTE" or "Optional" or "Recommended" or "Warning" are normalized to start with `>` (example `> **Note**`) and are placed immediately after the step they refer to, or at the end of the protocol if they clearly refer to the whole protocol.
+- Remove empty columns from tables.
+- Synchronize `Contents` with actual headings in the protocol.
 
-Output requirements:
+## Disallowed changes
+- Do not infer omitted concentrations, times, temperatures, or volumes.
+- Do not infer values for missing quantities.
+- Do not try to calculate or infer values that are not explicitly stated.
+- Do not convert `overnight`, `RT`, `briefly`, `room temperature` or similar vague language into precise values.
+- Do not replace vague language with precise values.
+- Do not reorder steps unless the source clearly numbers them in that order.
+- Do not remove duplicate-looking content unless it is truly identical and both copies are preserved in review notes.
+- Do not rewrite scientific wording for style if that risks changing meaning.
+- Do not fill in table cells with values that are missing from the source.
+- Do not replace one reagent name with another.
+- Do not remove repeated warnings or notes.
+- Do not omit unmapped text.
+
+## Output requirements
 - edit `README.md`
-- use the template headings
-- keep the template badge at the top
-- remove the template instruction note
+- use the template headings exactly
+- use the template headings in `README.md`
+- keep all source content
+- add `CHECK:` markers for uncertainty
+- use `CHECK:` only for genuine unresolved uncertainty. If no uncertainty remains, do not mention `CHECK:` at all
+- add `# Migration notes`
+- after drafting, add a short summary in `# Migration notes` covering:
+  - formatting normalizations performed
+  - ambiguities and uncertainty flagged
+  - content placed in `## Unplaced content`
 - add `# Migration notes` including:
   - imported protocol metadata from `source-metadata.yml` if present
+  - imported protocol metadata from `source-metadata.yml` using only the non-blank lines
   - template metadata from `template-metadata.yml`
   - ambiguous mappings
-  - formatting normalizations performed
+  - normalized formatting changes
   - content copied verbatim but not confidently placed
+- keep the template badge at the top
+- keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file
+- remove the template instruction note
+- delete the "Template repository: Click `Use this template` to create a new protocol repo..." note
 
+## Verification
 After drafting, verify the migration against the source:
-- compare the migrated `README.md` against `legacy/source.txt`
-- - compare the migrated `README.md` against the PDF in `legacy/`
+- compare the migrated `README.md` against `legacy/source.md`
+- compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt`
+- compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity
 - check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present
 - check that no source content has been silently omitted, merged, or reordered without justification
 - check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/`
 - leave `CHECK:` anywhere the mapping is uncertain rather than guessing
 
 Verification checklist:
-- `README.md` still matches the scientific content of `legacy/source.txt`
+- `README.md` still matches the scientific content of `legacy/source.md`
+- any malformed, incomplete, or ambiguous passages were cross-checked against `legacy/source.txt`
 - no protocol steps or warnings were omitted
 - no values were invented or made more precise than in the source
 - tables and layout-dependent content were checked against the PDF in `legacy/`
 - any uncertain mappings are marked with `CHECK:`
 - any meaningful normalization choices are noted in `# Migration notes`
 
-Prefer preserving meaning over making the output prettier.
\ No newline at end of file
+Prefer preserving meaning over making the output prettier.
diff --git a/.claude/.claude/skills/protocol-migration/SKILL.md b/.claude/.claude/skills/protocol-migration/SKILL.md
index 55ad0a5..7b806de 100644
--- a/.claude/.claude/skills/protocol-migration/SKILL.md
+++ b/.claude/.claude/skills/protocol-migration/SKILL.md
@@ -1,75 +1,113 @@
 ---
 name: protocol-migration
-description: Convert legacy/source.txt into README.md using the repository template, preserving scientific meaning and marking uncertainty with CHECK:.
+description: Convert legacy/source.md into README.md using the repository template, using source.txt only as fallback and marking uncertainty with CHECK:.
 ---
 
 Use this skill when migrating a legacy protocol into this repository template.
 
 Goal:
-Convert `legacy/source.txt` into `README.md`, using the existing `README.md` as the target template and structure.
+Convert `legacy/source.md` into `README.md`, using the existing `README.md` as the target template and structure.
 
-Primary sources:
-- `legacy/source.txt` is the main source
-- also consult the PDF in `legacy/` for tables, layout-dependent content, and anything unclear
+## Primary rule
+Do not change protocol meaning.
+Use `legacy/source.md` as the primary source when rewriting `README.md`.
+Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear.
+Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous.
+If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker.
 
-Core rules:
-- Do not change protocol meaning
-- Do not invent missing information
-- Do not delete source content
-- Do not silently summarize, compress, or merge steps
-- Preserve exact reagent names, quantities, timings, temperatures, and conditions unless only formatting is being normalized
-- Preserve step order unless the source clearly indicates otherwise
-- If anything is uncertain, mark it with `CHECK:` instead of guessing
+## Migration behavior
+When converting legacy protocol content into the repository template:
 
-If content does not fit cleanly:
-- place it under `# Migration notes` or `## Unplaced content`
+- Preserve all protocol content.
+- Preserve all procedural content, warnings, notes, reagent names, quantities, timings, temperatures, and conditions, preserving their location.
+- Do not change scientific meaning.
+- Do not invent missing information.
+- Do not invent missing values or steps.
+- Do not delete any content from the source.
+- Do not silently summarize, compress, or merge steps.
+- Keep exact reagent names, quantities, temperatures, timings, and conditions unless only formatting is being normalized.
+- Preserve exact reagent and equipment names unless only formatting is changing.
+- Preserve the step order from the source unless the source clearly indicates otherwise.
+- Do not delete repeated warnings or notes.
+- If any text does not fit cleanly into the template, place it under `# Migration notes` or `## Unplaced content`.
+- Mark uncertainty with `CHECK:` instead of guessing.
 
-Allowed formatting normalization only when meaning is unchanged:
-- add a space between numbers and units
-- standardize temperature formatting to `37 °C`
-- standardize volumes to `µL`, `mL`, `L`
-- standardize concentrations to `mM`, `µM`, `nM`, `% (w/v)`
-- standardize time units to `seconds`, `minutes`, `hours`
-- standardize pH formatting to `pH 7.4`
-- normalize bullets, headings, and markdown tables to match the template
-- use tables for reaction mixes and other tabular content
-- use HTML subscripts for chemical formulas where needed
-- normalize note-like text to blockquote style, e.g. `> **Note**`
+## Allowed formatting normalization
+You may normalize formatting only when the meaning is unchanged and unambiguous:
 
-Do not:
-- infer omitted values
-- replace vague wording like `overnight` or `room temperature` with precise values
-- reorder steps unless clearly justified by the source
-- remove repeated warnings or notes
-- replace one reagent with another
-- omit unmapped text
+- Add a space between numbers and units.
+- Standardize temperature formatting to `37 °C`.
+- Standardize volume units to `µL`, `mL`, `L`, using the micro sign `µ` consistently.
+- Standardize concentration units to `mM`, `µM`, `nM`, `% (w/v)`, etc., using the micro sign `µ` consistently.
+- Standardize time units to full words: `seconds`, `minutes`, `hours`.
+- Standardize chemical names to match the source but with consistent formatting (e.g. `Tris-HCl` instead of `Tris HCl`).
+- Standardize pH formatting to `pH 7.4`.
+- Standardize chemical formulas with HTML subscripts, for example H2O to H<sub>2</sub>O. Similarly for other chemical formulas (e.g. MgCl2 to MgCl<sub>2</sub>).
+- Do not use Unicode subscript characters such as `₂`.
+- Standardize `RNAseq` or `RNA-Seq` to `RNA-seq`. Same for `ChIP-seq`, `ATAC-seq`, etc.
+- Normalize bullet formatting and markdown table formatting.
+- Normalize heading structure to match the repository template.
+- For reaction mixes and anything tabular, place them inside a table as in template.
+- Normalize markdown headings, bullets, and tables.
+- "Note" or "NOTE" or "Optional" or "Recommended" or "Warning" are normalized to start with `>` (example `> **Note**`) and are placed immediately after the step they refer to, or at the end of the protocol if they clearly refer to the whole protocol.
+- Remove empty columns from tables.
+- Synchronize `Contents` with actual headings in the protocol.
 
-Output requirements:
+## Disallowed changes
+- Do not infer omitted concentrations, times, temperatures, or volumes.
+- Do not infer values for missing quantities.
+- Do not try to calculate or infer values that are not explicitly stated.
+- Do not convert `overnight`, `RT`, `briefly`, `room temperature` or similar vague language into precise values.
+- Do not replace vague language with precise values.
+- Do not reorder steps unless the source clearly numbers them in that order.
+- Do not remove duplicate-looking content unless it is truly identical and both copies are preserved in review notes.
+- Do not rewrite scientific wording for style if that risks changing meaning.
+- Do not fill in table cells with values that are missing from the source.
+- Do not replace one reagent name with another.
+- Do not remove repeated warnings or notes.
+- Do not omit unmapped text.
+
+## Output requirements
 - edit `README.md`
-- use the template headings
-- keep the template badge at the top
-- remove the template instruction note
+- use the template headings exactly
+- use the template headings in `README.md`
+- keep all source content
+- add `CHECK:` markers for uncertainty
+- use `CHECK:` only for genuine unresolved uncertainty. If no uncertainty remains, do not mention `CHECK:` at all
+- add `# Migration notes`
+- after drafting, add a short summary in `# Migration notes` covering:
+  - formatting normalizations performed
+  - ambiguities and uncertainty flagged
+  - content placed in `## Unplaced content`
 - add `# Migration notes` including:
   - imported protocol metadata from `source-metadata.yml` if present
+  - imported protocol metadata from `source-metadata.yml` using only the non-blank lines
   - template metadata from `template-metadata.yml`
   - ambiguous mappings
-  - formatting normalizations performed
+  - normalized formatting changes
   - content copied verbatim but not confidently placed
+- keep the template badge at the top
+- keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file
+- remove the template instruction note
+- delete the "Template repository: Click `Use this template` to create a new protocol repo..." note
 
+## Verification
 After drafting, verify the migration against the source:
-- compare the migrated `README.md` against `legacy/source.txt`
-- - compare the migrated `README.md` against the PDF in `legacy/`
+- compare the migrated `README.md` against `legacy/source.md`
+- compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt`
+- compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity
 - check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present
 - check that no source content has been silently omitted, merged, or reordered without justification
 - check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/`
 - leave `CHECK:` anywhere the mapping is uncertain rather than guessing
 
 Verification checklist:
-- `README.md` still matches the scientific content of `legacy/source.txt`
+- `README.md` still matches the scientific content of `legacy/source.md`
+- any malformed, incomplete, or ambiguous passages were cross-checked against `legacy/source.txt`
 - no protocol steps or warnings were omitted
 - no values were invented or made more precise than in the source
 - tables and layout-dependent content were checked against the PDF in `legacy/`
 - any uncertain mappings are marked with `CHECK:`
 - any meaningful normalization choices are noted in `# Migration notes`
 
-Prefer preserving meaning over making the output prettier.
\ No newline at end of file
+Prefer preserving meaning over making the output prettier.
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 4026df8..f312e0e 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -4,18 +4,26 @@ This repository stores laboratory protocols in Markdown in `README.md`.
 
 ## Primary rule
 Do not change protocol meaning.
-Use `legacy/source.txt` as the primary source when rewriting `README.md`.
-Also consult the PDF file in `legacy/` as the reference source for tables, layout-dependent content, and anything unclear.
+Use `legacy/source.md` as the primary source when rewriting `README.md`.
+Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear.
+Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous.
+If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker.
 
 ## Migration behavior
-When converting legacy protocol text into the repository template:
+When converting legacy protocol content into the repository template:
 
+- Preserve all protocol content.
 - Preserve all procedural content, warnings, notes, reagent names, quantities, timings, temperatures, and conditions, preserving their location.
+- Do not change scientific meaning.
+- Do not invent missing information.
 - Do not invent missing values or steps.
 - Do not delete any content from the source.
 - Do not silently summarize, compress, or merge steps.
+- Keep exact reagent names, quantities, temperatures, timings, and conditions unless only formatting is being normalized.
 - If text does not map cleanly into the template, place it under `# Migration notes` or `## Unplaced content`.
 - If any interpretation is uncertain, mark it with `CHECK:` rather than guessing.
+- Do not delete repeated warnings or notes.
+- Preserve the step order from the source unless the source clearly indicates otherwise.
 - Preserve exact reagent and equipment names unless only formatting is changing.
 
 ## Allowed formatting normalization
@@ -25,15 +33,15 @@ You may normalize formatting only when the meaning is unchanged and unambiguous:
 - Standardize volume units to `µL`, `mL`, `L`, using the micro sign `µ` consistently.
 - Standardize concentration units to `mM`, `µM`, `nM`, `% (w/v)`, etc., using the micro sign `µ` consistently.
 - Standardize time units to full words: `seconds`, `minutes`, `hours`.
-- Standarize chemical names to match the source but with consistent formatting (e.g. `Tris-HCl` instead of `Tris HCl`).
+- Standardize chemical names to match the source but with consistent formatting (e.g. `Tris-HCl` instead of `Tris HCl`).
 - Standardize pH formatting to `pH 7.4`.
 - Standardize chemical formulas with HTML subscripts, for example H2O to H<sub>2</sub>O. Similarly for other chemical formulas (e.g. MgCl2 to MgCl<sub>2</sub>).
 - Do not use Unicode subscript characters such as `₂`.
-- Standardize `RNAseq` or `RNA-Seq` to `RNA-seq`.Same for `ChIP-seq`, `ATAC-seq`, etc.
+- Standardize `RNAseq` or `RNA-Seq` to `RNA-seq`. Same for `ChIP-seq`, `ATAC-seq`, etc.
 - Normalize bullet formatting and markdown table formatting.
 - Normalize heading structure to match the repository template.
 - For reaction mixes and anything tabular, place them inside a table as in template.
-- Normalize markdown headings, bullets, and tables
+- Normalize markdown headings, bullets, and tables.
 - "Note" or "NOTE" or "Optional" or "Recommended" or "Warning" are normalized to start with `>` (example `> **Note**`) and are placed immediately after the step they refer to, or at the end of the protocol if they clearly refer to the whole protocol.
 - Remove empty columns from tables.
 - Synchronize `Contents` with actual headings in the protocol.
@@ -55,17 +63,42 @@ You may normalize formatting only when the meaning is unchanged and unambiguous:
 ## Output requirements
 When drafting a migrated protocol:
 - Use the template headings exactly.
-- Use the template headings in `README.md`
+- Use the template headings in `README.md`.
 - Keep all source content.
 - Add `CHECK:` markers for uncertainty.
 - Use `CHECK:` only for genuine unresolved uncertainty. If no uncertainty remains, do not mention `CHECK:` at all.
 - Add an `# Migration notes` section listing:
+  - formatting normalizations performed
+  - ambiguities and uncertainty flagged
+  - content placed in `## Unplaced content`
   - Imported protocol metadata from `source-metadata.yml` (only the non-blank lines).
-  - template metadata from `template-metadata.yml`
-  - ambiguous mappings
-  - normalized formatting changes
-  - content copied verbatim but not confidently placed
+  - Imported protocol metadata from `source-metadata.yml` if present.
+  - template metadata from `template-metadata.yml`.
+  - ambiguous mappings.
+  - normalized formatting changes.
+  - content copied verbatim but not confidently placed.
 - Keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file.
 - Delete the "Template repository: Click `Use this template` to create a new protocol repo..." note.
+- Remove the template instruction note.
 
+## Verification
+After drafting, verify the migration against the source:
 
+- compare the migrated `README.md` against `legacy/source.md`
+- compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt`
+- compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity
+- check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present
+- check that no source content has been silently omitted, merged, or reordered without justification
+- check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/`
+- leave `CHECK:` anywhere the mapping is uncertain rather than guessing
+
+Verification checklist:
+- `README.md` still matches the scientific content of `legacy/source.md`
+- any malformed, incomplete, or ambiguous passages were cross-checked against `legacy/source.txt`
+- no protocol steps or warnings were omitted
+- no values were invented or made more precise than in the source
+- tables and layout-dependent content were checked against the PDF in `legacy/`
+- any uncertain mappings are marked with `CHECK:`
+- any meaningful normalization choices are noted in `# Migration notes`
+
+Prefer preserving meaning over making the output prettier.
diff --git a/.github/workflows/pdf-to-markdown.yml b/.github/workflows/pdf-to-markdown.yml
new file mode 100644
index 0000000..cf04edb
--- /dev/null
+++ b/.github/workflows/pdf-to-markdown.yml
@@ -0,0 +1,101 @@
+name: pdf-to-markdown
+
+on:
+  push:
+    branches-ignore:
+      - main
+    paths:
+      - legacy/*.pdf
+      - legacy/*.PDF
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  convert:
+    runs-on: ubuntu-latest
+    env:
+      LEGACY_DIR: legacy
+
+    steps:
+      - name: Check out repo
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          ref: ${{ github.ref }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Check whether conversion should run
+        id: pdf_gate
+        run: |
+          set -euo pipefail
+
+          if [ ! -d "$LEGACY_DIR" ]; then
+            echo "should_convert=false" >> "$GITHUB_OUTPUT"
+            echo "Conversion skipped: $LEGACY_DIR/ does not exist."
+            exit 0
+          fi
+
+          mapfile -t pdf_files < <(find "$LEGACY_DIR" -maxdepth 1 -type f \( -iname '*.pdf' \) | sort)
+          pdf_count="${#pdf_files[@]}"
+
+          if [ "$pdf_count" -eq 0 ]; then
+            echo "should_convert=false" >> "$GITHUB_OUTPUT"
+            echo "Conversion skipped: no PDF files found in $LEGACY_DIR/."
+            exit 0
+          fi
+
+          if [ "$pdf_count" -gt 1 ]; then
+            printf 'Found %s PDF files in %s/:\n' "$pdf_count" "$LEGACY_DIR"
+            printf ' - %s\n' "${pdf_files[@]}"
+            echo "Expected exactly one PDF file in $LEGACY_DIR/."
+            exit 1
+          fi
+
+          echo "should_convert=true" >> "$GITHUB_OUTPUT"
+          echo "pdf_path=${pdf_files[0]}" >> "$GITHUB_OUTPUT"
+          echo "Using PDF: ${pdf_files[0]}"
+
+      - name: Install fast-mode dependencies
+        if: steps.pdf_gate.outputs.should_convert == 'true'
+        run: |
+          python -m pip install --upgrade pip
+          pip install pymupdf pymupdf4llm
+
+      - name: Convert PDF to legacy/source.md
+        if: steps.pdf_gate.outputs.should_convert == 'true'
+        run: |
+          set -euo pipefail
+          python scripts/pdf_to_md/pdf_to_md.py "$LEGACY_DIR" "$LEGACY_DIR/source.md" --no-progress
+
+      - name: Show generated files
+        if: steps.pdf_gate.outputs.should_convert == 'true'
+        run: |
+          find "$LEGACY_DIR" -maxdepth 3 \( -name "source.md" -o -path "$LEGACY_DIR/images/*" \) | sort
+
+      - name: Commit and push generated files
+        if: steps.pdf_gate.outputs.should_convert == 'true'
+        run: |
+          set -euo pipefail
+
+          branch_name="${GITHUB_REF_NAME}"
+
+          git config user.name "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+
+          git add -A "$LEGACY_DIR"
+
+          if git diff --cached --quiet; then
+            echo "No changes to commit"
+            exit 0
+          fi
+
+          git commit -m "Prepare migration: extract markdown from legacy PDF"
+          git fetch origin "$branch_name"
+          git rebase "origin/$branch_name"
+          git push origin "HEAD:$branch_name"
diff --git a/.github/workflows/prepare_migration.yml b/.github/workflows/pdf-to-text.yml
similarity index 67%
rename from .github/workflows/prepare_migration.yml
rename to .github/workflows/pdf-to-text.yml
index 54c0ae6..71ec8d9 100644
--- a/.github/workflows/prepare_migration.yml
+++ b/.github/workflows/pdf-to-text.yml
@@ -1,4 +1,4 @@
-name: Prepare migration from PDF
+name: pdf-to-text
 
 on:
   workflow_dispatch:
@@ -18,6 +18,9 @@ jobs:
     steps:
       - name: Check out repo
         uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          ref: ${{ github.ref }}
 
       - name: Set up Python
         uses: actions/setup-python@v5
@@ -40,8 +43,18 @@ jobs:
 
       - name: Commit extracted text
         run: |
+          set -euo pipefail
+
+          branch_name="${GITHUB_REF_NAME}"
+
           git config user.name "github-actions[bot]"
           git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
           git add legacy/source.txt
-          git diff --cached --quiet || git commit -m "Prepare migration: extract text from legacy PDF"
-          git push
+          if git diff --cached --quiet; then
+            echo "No changes to commit"
+            exit 0
+          fi
+          git commit -m "Prepare migration: extract text from legacy PDF"
+          git fetch origin "$branch_name"
+          git rebase "origin/$branch_name"
+          git push origin "HEAD:$branch_name"
diff --git a/.github/workflows/readme_to_pdf.yml b/.github/workflows/readme-to-pdf.yml
similarity index 81%
rename from .github/workflows/readme_to_pdf.yml
rename to .github/workflows/readme-to-pdf.yml
index 97bacb4..964112f 100644
--- a/.github/workflows/readme_to_pdf.yml
+++ b/.github/workflows/readme-to-pdf.yml
@@ -1,4 +1,4 @@
-name: Build README PDF
+name: README-to-pdf
 
 on:
   push:
@@ -42,14 +42,37 @@ jobs:
           set -euo pipefail
           cp README.md README._pdf.md
 
-      - name: Trim README preamble for PDF
+      - name: Drop leading YAML front matter for PDF
         run: |
           set -euo pipefail
           awk '
-            found || /^# / {
-              found = 1
+            BEGIN { in_front_matter = 0 }
+            NR == 1 && /^---[[:space:]]*$/ {
+              in_front_matter = 1
+              next
+            }
+            in_front_matter && (/^---[[:space:]]*$/ || /^\.\.\.[[:space:]]*$/) {
+              in_front_matter = 0
+              next
+            }
+            in_front_matter { next }
+            { print }
+          ' README._pdf.md > README._pdf.tmp && mv README._pdf.tmp README._pdf.md
+
+      - name: Remove top template preamble for PDF
+        run: |
+          set -euo pipefail
+          awk '
+            BEGIN { started = 0 }
+            !started {
+              if (/^\[!\[/) next
+              if (/^> Template repository:/) next
+              if (/^[[:space:]]*$/) next
+              started = 1
               print
+              next
             }
+            { print }
           ' README._pdf.md > README._pdf.tmp && mv README._pdf.tmp README._pdf.md
 
       - name: Exclude migration notes from PDF
@@ -107,6 +130,7 @@ jobs:
 
           pandoc README._pdf.md \
             -o "${repo_name}.pdf" \
+            --from=markdown-yaml_metadata_block \
             --pdf-engine=xelatex \
             -V colorlinks=true \
             -V linkcolor=blue \
diff --git a/.github/workflows/validate_protocol.yml b/.github/workflows/validate-protocol.yml
similarity index 98%
rename from .github/workflows/validate_protocol.yml
rename to .github/workflows/validate-protocol.yml
index 8af38c8..bd4ea95 100644
--- a/.github/workflows/validate_protocol.yml
+++ b/.github/workflows/validate-protocol.yml
@@ -1,4 +1,4 @@
-name: Validate README protocol
+name: validate-protocol-README
 
 on:
   pull_request:
diff --git a/THIRD_PARTY_NOTICES.md b/THIRD_PARTY_NOTICES.md
new file mode 100644
index 0000000..acde86c
--- /dev/null
+++ b/THIRD_PARTY_NOTICES.md
@@ -0,0 +1,22 @@
+# Third-Party Notices
+
+This repository is primarily licensed under GPL-3.0, but it also contains
+specific files adapted from third-party code under different terms.
+
+## `pdf_to_md` scripts
+
+Files:
+- `scripts/pdf_to_md/pdf_to_md.py`
+- `scripts/pdf_to_md/extractor.py`
+
+These files include code adapted from:
+- Repository: `aliceisjustplaying/claude-skill-pdf-to-markdown`
+- Source: <https://github.com/aliceisjustplaying/claude-skill-pdf-to-markdown>
+
+Upstream licensing note:
+- The upstream repository README states `MIT` as the license.
+- Source checked: <https://github.com/aliceisjustplaying/claude-skill-pdf-to-markdown>
+
+Attribution note:
+- This repository records the upstream source and preserves attribution for the
+  adapted files listed above.
\ No newline at end of file
diff --git a/docs/PROMPT.md b/docs/PROMPT.md
index 03f22a6..b991a6d 100644
--- a/docs/PROMPT.md
+++ b/docs/PROMPT.md
@@ -1,30 +1,109 @@
-Convert `legacy/source.txt` into `README.md`.
+Convert `legacy/source.md` into `README.md`.
 
 Use the existing `README.md` as the target template and structure.
 
 Also read and follow `.github/copilot-instructions.md`.
 Apply those instructions even if you are not GitHub Copilot.
 
-Use `legacy/source.txt` as the primary source.
-Also check the PDF file in the `legacy/` folder as the reference source, especially for tables, layout-dependent content, and anything unclear.
-
-Requirements:
-1. Preserve all protocol content.
-2. Do not change scientific meaning.
-3. Do not invent missing information.
-4. Keep exact reagent names, quantities, temperatures, timings, and conditions unless only formatting is being normalized.
-5. Normalize only safe formatting, such as:
-   - adding a space between numbers and units
-   - using `seconds`, `minutes`, `hours`
-   - using `µL`, `mL`, `L`
-   - using `37 °C` style temperature formatting
-6. Preserve the step order from the source.
-7. Do not delete repeated warnings or notes.
-8. If any text does not fit cleanly into the template, place it under `# Migration notes` or `## Unplaced content`.
-9. Mark uncertainty with `CHECK:` instead of guessing.
-10. After drafting, add a short summary in `# Migration notes` covering:
-   - formatting normalizations performed
-   - ambiguities and uncertainty flagged
-   - content placed in `## Unplaced content`
-
-Only edit `README.md`.
+## Primary rule
+Do not change protocol meaning.
+Use `legacy/source.md` as the primary source when rewriting `README.md`.
+Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear.
+Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous after checking the generated text sources.
+If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker.
+
+## Migration behavior
+When converting legacy protocol content into the repository template:
+
+- Preserve all protocol content.
+- Preserve all procedural content, warnings, notes, reagent names, quantities, timings, temperatures, and conditions, preserving their location.
+- Do not change scientific meaning.
+- Do not invent missing information.
+- Do not invent missing values or steps.
+- Do not delete any content from the source.
+- Do not silently summarize, compress, or merge steps.
+- Keep exact reagent names, quantities, temperatures, timings, and conditions unless only formatting is being normalized.
+- Preserve exact reagent and equipment names unless only formatting is changing.
+- Preserve the step order from the source unless the source clearly indicates otherwise.
+- Do not delete repeated warnings or notes.
+- If any text does not fit cleanly into the template, place it under `# Migration notes` or `## Unplaced content`.
+- Mark uncertainty with `CHECK:` instead of guessing.
+
+## Allowed formatting normalization
+Normalize formatting only when the meaning is unchanged and unambiguous:
+
+- add a space between numbers and units
+- use `seconds`, `minutes`, `hours`
+- use `µL`, `mL`, `L`
+- use `37 °C` style temperature formatting
+- standardize concentration units to `mM`, `µM`, `nM`, `% (w/v)`, etc., using the micro sign `µ` consistently
+- standardize pH formatting to `pH 7.4`
+- standardize chemical names to match the source but with consistent formatting, for example `Tris-HCl` instead of `Tris HCl`
+- standardize chemical formulas with HTML subscripts, for example H2O to H<sub>2</sub>O and MgCl2 to MgCl<sub>2</sub>
+- do not use Unicode subscript characters such as `₂`
+- standardize `RNAseq` or `RNA-Seq` to `RNA-seq`, and similarly for `ChIP-seq`, `ATAC-seq`, and related names
+- normalize bullets, headings, and markdown tables to match the repository template
+- use tables for reaction mixes and other tabular content
+- normalize note-like text to blockquote style, for example `> **Note**`
+- place note-like text immediately after the step it refers to, or at the end of the protocol if it clearly refers to the whole protocol
+- remove empty columns from tables
+- synchronize `Contents` with the actual headings in the protocol
+
+## Disallowed changes
+- do not infer omitted concentrations, times, temperatures, or volumes
+- do not infer values for missing quantities
+- do not try to calculate or infer values that are not explicitly stated
+- do not convert `overnight`, `RT`, `briefly`, `room temperature`, or similar vague language into precise values
+- do not replace vague language with precise values
+- do not reorder steps unless the source clearly numbers them in that order
+- do not remove duplicate-looking content unless it is truly identical and both copies are preserved in review notes
+- do not rewrite scientific wording for style if that risks changing meaning
+- do not fill in table cells with values that are missing from the source
+- do not replace one reagent name with another
+- do not remove repeated warnings or notes
+- do not omit unmapped text
+
+## Output requirements
+- Only edit `README.md`.
+- Use the template headings exactly.
+- Use the template headings in `README.md`.
+- Keep all source content.
+- Add `CHECK:` markers for uncertainty.
+- Use `CHECK:` only for genuine unresolved uncertainty. If no uncertainty remains, do not mention `CHECK:` at all.
+- Add an `# Migration notes` section.
+- After drafting, add a short summary in `# Migration notes` covering:
+  - formatting normalizations performed
+  - ambiguities and uncertainty flagged
+  - content placed in `## Unplaced content`
+- Include the following in `# Migration notes`:
+  - imported protocol metadata from `source-metadata.yml` if present
+  - imported protocol metadata from `source-metadata.yml` using only the non-blank lines
+  - template metadata from `template-metadata.yml`
+  - ambiguous mappings
+  - normalized formatting changes
+  - content copied verbatim but not confidently placed
+- Keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file.
+- Remove the template instruction note.
+- Delete the "Template repository: Click `Use this template` to create a new protocol repo..." note.
+
+## Verification
+After drafting, verify the migration against the source:
+
+- compare the migrated `README.md` against `legacy/source.md`
+- compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt`
+- compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity
+- check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present
+- check that no source content has been silently omitted, merged, or reordered without justification
+- check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/`
+- leave `CHECK:` anywhere the mapping is uncertain rather than guessing
+
+Verification checklist:
+- `README.md` still matches the scientific content of `legacy/source.md`
+- any malformed, incomplete, or ambiguous passages were cross-checked against `legacy/source.txt`
+- no protocol steps or warnings were omitted
+- no values were invented or made more precise than in the source
+- tables and layout-dependent content were checked against the PDF in `legacy/`
+- any uncertain mappings are marked with `CHECK:`
+- any meaningful normalization choices are noted in `# Migration notes`
+
+Prefer preserving meaning over making the output prettier.
diff --git a/docs/USING_THIS_TEMPLATE.md b/docs/USING_THIS_TEMPLATE.md
index 833102a..0f81379 100644
--- a/docs/USING_THIS_TEMPLATE.md
+++ b/docs/USING_THIS_TEMPLATE.md
@@ -101,11 +101,11 @@ This route can save time. It helps keep the template structure consistent, norma
 
 > **Recommended**: Also fill in the `source-metadata.yml`, even if not fully. Helps track source protocol provenance.
 4. Keep exactly one PDF in the `legacy` folder, otherwise the process will fail.
-5. Once you push a PDF change in the `legacy` folder to a non-`main` branch, the `Prepare migration from PDF` GitHub Action will run. This extracts the PDF text and writes `legacy/source.txt`. Check that this file was created before the next step.
+5. Once you push a PDF change in the `legacy` folder to a non-`main` branch, the migration GitHub Actions will run. `pdf-to-text` writes `legacy/source.txt`, and `pdf-to-markdown` writes `legacy/source.md`. Check that these files were created before the next step.
 6. Clone the repo locally, and switch to `import-protocol` branch. If you already have a local clone, run `git pull` to get the latest changes locally.
 > **Note**: Alternatively, you can complete steps 6-15 in GitHub Codespaces. On GitHub.com select the branch you want to work on, click **Code**, go to **Codespaces** tab and click **Create codespace on import-protocol**. This will open VS Code in a new browser tab, with all files loaded automatically. Note that this uses GitHub-hosted compute, and free usage is limited.
 7. Open the repo folder in a code editor and use GitHub Copilot or another LLM assistant. We recommend [VS Code](https://code.visualstudio.com/).
-8. Use the `protocol-migration` skill (or if you prefer, paste the prompt in `docs/PROMPT.md`) to ask GitHub Copilot or another LLM to rewrite `README.md`. The model will also follow the repository instructions in [`.github/copilot-instructions.md`](.github/copilot-instructions.md). This will edit the `README.md` file in-place, using `legacy/source.txt` and the legacy PDF as sources.
+8. Use the `protocol-migration` skill (or if you prefer, paste the prompt in `docs/PROMPT.md`) to ask GitHub Copilot or another LLM to rewrite `README.md`. The model will also follow the repository instructions in [`.github/copilot-instructions.md`](.github/copilot-instructions.md). This will edit the `README.md` file in-place, using `legacy/source.md` as the primary source, `legacy/source.txt` as a fallback when needed, and the legacy PDF as the final tie-breaker for tables, figures, and unclear layout-dependent content.
 > **Note**: Use the best model you have access to. We tested capability with the Copilot Free Usage plan, and it works reasonably well, but advanced models will likely work even better.
 
 **In VS Code**:
diff --git a/protocol-template.pdf b/protocol-template.pdf
index 808d2c0..dc56efa 100644
Binary files a/protocol-template.pdf and b/protocol-template.pdf differ
diff --git a/scripts/pdf_to_md/extractor.py b/scripts/pdf_to_md/extractor.py
new file mode 100755
index 0000000..1be2722
--- /dev/null
+++ b/scripts/pdf_to_md/extractor.py
@@ -0,0 +1,297 @@
+# Adapted in part from:
+# https://github.com/aliceisjustplaying/claude-skill-pdf-to-markdown
+# Original upstream license: MIT
+# See THIRD_PARTY_NOTICES.md for attribution details.
+
+"""
+PDF extraction with multiple backends:
+- Fast mode: PyMuPDF with multi-strategy table detection (good for simple tables)
+- Accurate mode: IBM Docling with TableFormer AI (better for complex/borderless tables)
+"""
+
+import os
+import sys
+from pathlib import Path
+
+# Suppress PyMuPDF's "Consider using pymupdf_layout" recommendation
+# This prints to stdout and pollutes --stdout output
+os.environ.setdefault("PYMUPDF_SUGGEST_LAYOUT_ANALYZER", "0")
+
+# Version for cache invalidation - increment when extraction logic changes
+# Format: major.minor.patch
+# 3.1.0: Page separators now use <!-- PAGE_BREAK --> instead of -----
+#        Image extraction includes nested XObjects (full=True)
+# 3.2.0: Fast mode now includes image references in markdown (write_images=True)
+#        Cache keys now include no_images flag to avoid contamination
+# 3.3.0: Image paths in cached markdown now use relative 'images/' prefix
+#        (fixes broken temp directory references in cached output)
+EXTRACTOR_VERSION = "3.3.0"
+
+
+def check_docling_models():
+    """Check if Docling models are downloaded."""
+    try:
+        from huggingface_hub import scan_cache_dir
+
+        cache_info = scan_cache_dir()
+        # Check for docling models in HF cache
+        docling_repos = [r for r in cache_info.repos if "docling" in r.repo_id.lower()]
+        return len(docling_repos) > 0
+    except Exception:
+        return False
+
+
+def extract_pdf_fast(
+    pdf_path: str, image_dir: str = None, show_progress: bool = False
+) -> str:
+    """
+    Fast PDF extraction using PyMuPDF with text-based table detection.
+
+    Uses 'text' table strategy which handles borderless/whitespace-based
+    tables better than the default 'lines_strict' for mixed document types.
+
+    Args:
+        pdf_path: Path to the PDF file
+        image_dir: Directory to save extracted images (None = skip images)
+        show_progress: Whether to show progress output
+
+    Returns:
+        Markdown string of the PDF content with image references if image_dir provided
+    """
+    import pymupdf4llm
+
+    if show_progress:
+        print("Extracting with PyMuPDF (fast mode)...", file=sys.stderr)
+
+    # Use text strategy which handles borderless tables better
+    # than the default lines_strict
+    markdown = pymupdf4llm.to_markdown(
+        pdf_path,
+        show_progress=show_progress,
+        table_strategy="text",  # Better for mixed table types
+        write_images=image_dir is not None,
+        image_path=image_dir,
+    )
+
+    # Replace pymupdf4llm's default page separator with explicit sentinel.
+    # This prevents false splits when documents contain literal "-----"
+    # (horizontal rules, ASCII tables, etc.)
+    markdown = markdown.replace("\n-----\n", "\n<!-- PAGE_BREAK -->\n")
+
+    return markdown
+
+
+def _save_docling_images(result, output_dir: Path) -> list:
+    """
+    Save images from a Docling conversion result to output directory.
+
+    Images are saved in iteration order, which matches the order of
+    <!-- image --> placeholders in the exported markdown.
+
+    Args:
+        result: Docling ConversionResult object
+        output_dir: Directory to save images to
+
+    Returns:
+        List of saved image paths (in iteration order)
+    """
+    output_dir.mkdir(parents=True, exist_ok=True)
+    image_paths = []
+
+    for i, (element, _level) in enumerate(result.document.iterate_items()):
+        if hasattr(element, "image") and element.image is not None:
+            img_path = output_dir / f"figure_{i:04d}.png"
+            element.image.pil_image.save(str(img_path))
+            image_paths.append(str(img_path))
+
+    return image_paths
+
+
+def extract_pdf_docling(
+    pdf_path: str,
+    output_dir: str = None,
+    images_scale: float = 4.0,
+    show_progress: bool = False,
+) -> tuple:
+    """
+    Extract PDF using Docling with accurate tables + high-res images.
+
+    Uses IBM's TableFormer AI model for ~93.6% table extraction accuracy.
+    Also extracts images at configurable resolution (default 4x for crisp images).
+
+    Args:
+        pdf_path: Path to the PDF file
+        output_dir: Directory to save extracted images (None = skip images)
+        images_scale: Image resolution multiplier (default: 4.0 for high-res)
+        show_progress: Whether to show progress output
+
+    Returns:
+        tuple: (markdown: str, image_paths: list[str])
+    """
+    from docling.document_converter import DocumentConverter, PdfFormatOption
+    from docling.datamodel.base_models import InputFormat
+    from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
+    from docling_core.types.doc.base import ImageRefMode
+
+    # Check if this is first run (models need downloading)
+    if not check_docling_models():
+        print(
+            "First run: downloading Docling AI models (one-time setup, ~2-3 minutes)...",
+            file=sys.stderr,
+        )
+
+    if show_progress:
+        print(
+            f"Processing PDF with Docling (accurate mode, ~1 sec/page)...",
+            file=sys.stderr,
+        )
+
+    # Configure pipeline for accurate tables + image extraction
+    pipeline_options = PdfPipelineOptions(
+        do_table_structure=True,
+        generate_picture_images=output_dir is not None,
+        images_scale=images_scale,
+    )
+    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
+
+    converter = DocumentConverter(
+        format_options={
+            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
+        }
+    )
+
+    # Convert the document
+    result = converter.convert(pdf_path)
+
+    # Check for conversion errors
+    if hasattr(result, "errors") and result.errors:
+        for error in result.errors:
+            print(f"WARNING: Docling conversion error: {error}", file=sys.stderr)
+
+    # Check conversion status
+    from docling.datamodel.base_models import ConversionStatus
+
+    if hasattr(result, "status") and result.status != ConversionStatus.SUCCESS:
+        print(
+            f"WARNING: Docling conversion status: {result.status.name}",
+            file=sys.stderr,
+        )
+
+    # Save images to output directory (order matters for placeholder replacement)
+    image_paths = []
+    if output_dir:
+        image_paths = _save_docling_images(result, Path(output_dir))
+        if show_progress and image_paths:
+            print(
+                f"Extracted {len(image_paths)} images at {images_scale}x resolution",
+                file=sys.stderr,
+            )
+
+    # Export markdown with placeholders
+    md = result.document.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER)
+
+    # Replace placeholders with actual image references (order must match iteration order)
+    for img_path in image_paths:
+        md = md.replace("<!-- image -->", f"![Figure](images/{Path(img_path).name})", 1)
+
+    return md, image_paths
+
+
+def extract_pdf_to_markdown(
+    pdf_path: str, accurate: bool = False, show_progress: bool = False
+) -> str:
+    """
+    Extract PDF to markdown with configurable accuracy/speed trade-off.
+
+    Args:
+        pdf_path: Path to the PDF file
+        accurate: If True, use Docling AI (better for complex tables, slower).
+                  If False, use PyMuPDF (fast, good for simple tables).
+        show_progress: Whether to show progress output
+
+    Returns:
+        Markdown string of the PDF content
+    """
+    if accurate:
+        # Use Docling without image extraction
+        md, _ = extract_pdf_docling(
+            pdf_path, output_dir=None, show_progress=show_progress
+        )
+        return md
+    else:
+        return extract_pdf_fast(pdf_path, show_progress=show_progress)
+
+
+def get_page_count(pdf_path: str) -> int:
+    """Get the number of pages in a PDF using pymupdf (faster than Docling for this)."""
+    import pymupdf
+
+    doc = pymupdf.open(pdf_path)
+    count = len(doc)
+    doc.close()
+    return count
+
+
+def extract_images(pdf_path: str, output_dir: str, show_progress: bool = False) -> list:
+    """
+    Extract images from PDF to output directory.
+
+    Uses pymupdf for image extraction since Docling focuses on document structure.
+    Deduplicates by xref to avoid extracting the same image multiple times
+    (e.g., icons/logos reused across pages).
+
+    Returns:
+        List of extracted image paths
+    """
+    import pymupdf
+
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+
+    doc = pymupdf.open(pdf_path)
+    extracted = []
+    image_count = 0
+    seen_xrefs = set()  # Track already-extracted images by xref
+
+    for page_num in range(len(doc)):
+        page = doc[page_num]
+        # full=True includes images nested inside form XObjects (common in
+        # documents exported from Word/PowerPoint)
+        images = page.get_images(full=True)
+
+        for img_index, img in enumerate(images):
+            try:
+                xref = img[0]
+
+                # Skip if we've already extracted this image
+                if xref in seen_xrefs:
+                    continue
+                seen_xrefs.add(xref)
+
+                pix = pymupdf.Pixmap(doc, xref)
+
+                # Convert CMYK to RGB if necessary
+                if pix.n - pix.alpha > 3:
+                    pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
+
+                image_count += 1
+                img_filename = f"image_{image_count:04d}.png"
+                img_path = output_path / img_filename
+                pix.save(str(img_path))
+                extracted.append(str(img_path))
+
+                pix = None
+            except Exception as e:
+                # Log instead of silently swallowing errors
+                print(
+                    f"WARNING: Failed to extract image {img_index} on page {page_num + 1}: {e}",
+                    file=sys.stderr,
+                )
+                continue
+
+    doc.close()
+
+    if show_progress and extracted:
+        print(f"Extracted {len(extracted)} unique images", file=sys.stderr)
+
+    return extracted
diff --git a/scripts/pdf_to_md/pdf_to_md.py b/scripts/pdf_to_md/pdf_to_md.py
new file mode 100755
index 0000000..bf58069
--- /dev/null
+++ b/scripts/pdf_to_md/pdf_to_md.py
@@ -0,0 +1,867 @@
+#!/usr/bin/env python3
+from __future__ import annotations
+
+# Adapted in part from:
+# https://github.com/aliceisjustplaying/claude-skill-pdf-to-markdown
+# Original upstream license: MIT
+# See THIRD_PARTY_NOTICES.md for attribution details.
+
+"""
+PDF to Markdown Converter for LLM Context
+
+Extracts entire PDF content as clean, structured markdown.
+Images are extracted to cache directory and copied to output location.
+
+Features:
+- High-accuracy table extraction using IBM Docling (TableFormer AI model)
+- Aggressive persistent caching (extracts once, reuses forever)
+- Cache only cleared on explicit request or source file change
+
+Usage:
+    python pdf_to_md.py <input.pdf> [output.md]
+    python pdf_to_md.py <input.pdf> --docling      # Accurate tables (slower)
+    python pdf_to_md.py <input.pdf> --clear-cache  # Re-extract
+    python pdf_to_md.py --clear-all-cache          # Clear entire cache
+
+Dependencies:
+    uv pip install pymupdf pymupdf4llm   # Fast mode
+    uv pip install docling docling-core  # Docling mode (optional)
+"""
+
+import argparse
+import sys
+import os
+import re
+import json
+import hashlib
+import shutil
+import tempfile
+from dataclasses import dataclass
+from pathlib import Path
+from datetime import datetime
+
+
+# =============================================================================
+# DATACLASSES
+# =============================================================================
+
+
+@dataclass
+class ExtractionConfig:
+    """Configuration for PDF extraction."""
+
+    pdf_path: str
+    docling: bool = False
+    images_scale: float = 4.0
+
+
+@dataclass
+class ExtractionResult:
+    """Result of PDF extraction or cache load."""
+
+    markdown: str
+    image_dir: Path | None
+    total_pages: int
+    from_cache: bool = False
+
+
+# Suppress PyMuPDF's "Consider using pymupdf_layout" recommendation
+os.environ.setdefault("PYMUPDF_SUGGEST_LAYOUT_ANALYZER", "0")
+
+# Default cache directory
+DEFAULT_CACHE_DIR = Path.home() / ".cache" / "pdf-to-markdown"
+
+
+def resolve_input_pdf(input_arg: str) -> Path:
+    """Resolve either a direct PDF path or a folder containing exactly one PDF."""
+    input_path = Path(input_arg)
+
+    if input_path.is_file():
+        return input_path
+
+    if input_path.is_dir():
+        pdf_files = sorted(
+            [p for p in input_path.iterdir() if p.is_file() and p.suffix.lower() == ".pdf"]
+        )
+
+        if len(pdf_files) == 0:
+            raise FileNotFoundError(
+                f"No PDF files found in folder: {input_path}. Expected exactly one .pdf file."
+            )
+
+        if len(pdf_files) > 1:
+            names = ", ".join(p.name for p in pdf_files)
+            raise ValueError(
+                f"Multiple PDF files found in folder: {input_path}. "
+                f"Expected exactly one .pdf file, found {len(pdf_files)}: {names}"
+            )
+
+        return pdf_files[0]
+
+    raise FileNotFoundError(f"Input path does not exist: {input_path}")
+
+
+# =============================================================================
+# CACHE MANAGER
+# =============================================================================
+
+
+class CacheManager:
+    """Manages PDF extraction cache."""
+
+    def __init__(self, cache_dir: Path = None):
+        self.cache_dir = cache_dir or DEFAULT_CACHE_DIR
+
+    def get_key(self, config: ExtractionConfig) -> str:
+        """Generate cache key from file content + size + mode."""
+        p = Path(config.pdf_path).resolve()
+        stat = p.stat()
+        file_size = stat.st_size
+
+        chunk_size = 65536  # 64KB
+        hasher = hashlib.sha256()
+
+        with open(p, "rb") as f:
+            if file_size <= chunk_size * 2:
+                hasher.update(f.read())
+            else:
+                hasher.update(f.read(chunk_size))
+                f.seek(-chunk_size, 2)
+                hasher.update(f.read(chunk_size))
+
+        mode = f"docling_{config.images_scale}" if config.docling else "fast"
+        raw = f"{file_size}|{hasher.hexdigest()}|{mode}"
+        return hashlib.sha256(raw.encode()).hexdigest()[:16]
+
+    def _get_dir(self, cache_key: str) -> Path:
+        """Get cache directory for a given cache key."""
+        return self.cache_dir / cache_key
+
+    def is_valid(self, config: ExtractionConfig) -> tuple[bool, str]:
+        """Check if valid cache exists for this PDF."""
+        from extractor import EXTRACTOR_VERSION
+
+        try:
+            cache_key = self.get_key(config)
+        except (FileNotFoundError, OSError):
+            return False, ""
+
+        cache_dir = self._get_dir(cache_key)
+        metadata_file = cache_dir / "metadata.json"
+        output_file = cache_dir / "full_output.md"
+
+        if not metadata_file.exists() or not output_file.exists():
+            return False, cache_key
+
+        try:
+            with open(metadata_file) as f:
+                metadata = json.load(f)
+
+            p = Path(config.pdf_path).resolve()
+            stat = p.stat()
+
+            if (
+                metadata.get("source_size") != stat.st_size
+                or metadata.get("source_mtime") != stat.st_mtime
+            ):
+                return False, cache_key
+
+            if metadata.get("extractor_version") != EXTRACTOR_VERSION:
+                return False, cache_key
+
+            return True, cache_key
+        except (json.JSONDecodeError, KeyError, OSError):
+            return False, cache_key
+
+    def load(self, cache_key: str) -> ExtractionResult | None:
+        """Load markdown from cache."""
+        cache_dir = self._get_dir(cache_key)
+
+        try:
+            full_md = (cache_dir / "full_output.md").read_text(encoding="utf-8")
+            with open(cache_dir / "metadata.json") as f:
+                metadata = json.load(f)
+            total_pages = metadata.get("total_pages", 0)
+        except (FileNotFoundError, IOError, json.JSONDecodeError, OSError) as e:
+            print(
+                f"WARNING: Cache corrupted ({e.__class__.__name__}), regenerating...",
+                file=sys.stderr,
+            )
+            try:
+                if cache_dir.exists():
+                    shutil.rmtree(cache_dir)
+            except OSError:
+                pass
+            return None
+
+        # Check if markdown references images
+        has_image_refs = bool(re.search(r"!\[[^\]]*\]\([^)]+\)", full_md))
+
+        # Get cached images directory
+        cached_image_dir = cache_dir / "images"
+        has_images = cached_image_dir.exists() and any(cached_image_dir.iterdir())
+
+        # If markdown expects images but they're missing, invalidate cache
+        if has_image_refs and not has_images:
+            print(
+                "WARNING: Cache missing images, regenerating...",
+                file=sys.stderr,
+            )
+            try:
+                shutil.rmtree(cache_dir)
+            except OSError:
+                pass
+            return None
+
+        image_dir = cached_image_dir if has_images else None
+
+        return ExtractionResult(
+            markdown=full_md,
+            image_dir=image_dir,
+            total_pages=total_pages,
+            from_cache=True,
+        )
+
+    def _normalize_image_paths(self, markdown: str, source_image_dir: Path) -> str:
+        """Normalize image paths in markdown to use relative 'images/' prefix."""
+        if not source_image_dir:
+            return markdown
+
+        source_image_dir = Path(source_image_dir)
+
+        def normalize_ref(match):
+            alt_text = match.group(1)
+            filename_raw = match.group(2)
+            filename = Path(filename_raw).name
+            if (source_image_dir / filename).exists():
+                return f"![{alt_text}](images/{filename})"
+            return match.group(0)
+
+        pattern = r"!\[([^\]]*)\]\(([^)]+)\)"
+        return re.sub(pattern, normalize_ref, markdown)
+
+    def save(self, cache_key: str, result: ExtractionResult, config: ExtractionConfig):
+        """Save full extraction to cache using atomic writes."""
+        from extractor import EXTRACTOR_VERSION
+
+        cache_dir = self._get_dir(cache_key)
+        cache_dir.mkdir(parents=True, exist_ok=True)
+
+        markdown = result.markdown
+        if result.image_dir:
+            markdown = self._normalize_image_paths(markdown, result.image_dir)
+
+        p = Path(config.pdf_path).resolve()
+        stat = p.stat()
+        mode = f"docling_{config.images_scale}" if config.docling else "fast"
+
+        metadata = {
+            "source_path": str(p),
+            "source_mtime": stat.st_mtime,
+            "source_size": stat.st_size,
+            "cache_key": cache_key,
+            "cached_at": datetime.now().isoformat(),
+            "total_pages": result.total_pages,
+            "extractor_version": EXTRACTOR_VERSION,
+            "mode": mode,
+            "images_scale": config.images_scale if config.docling else None,
+        }
+
+        temp_md = None
+        temp_json = None
+        try:
+            with tempfile.NamedTemporaryFile(
+                mode="w",
+                dir=cache_dir,
+                suffix=".md.tmp",
+                delete=False,
+                encoding="utf-8",
+            ) as f:
+                f.write(markdown)
+                temp_md = f.name
+
+            with tempfile.NamedTemporaryFile(
+                mode="w", dir=cache_dir, suffix=".json.tmp", delete=False
+            ) as f:
+                json.dump(metadata, f, indent=2)
+                temp_json = f.name
+
+            os.replace(temp_md, cache_dir / "full_output.md")
+            temp_md = None
+            os.replace(temp_json, cache_dir / "metadata.json")
+            temp_json = None
+
+            if result.image_dir and Path(result.image_dir).exists():
+                temp_images = cache_dir / "images.tmp"
+                final_images = cache_dir / "images"
+
+                if temp_images.exists():
+                    shutil.rmtree(temp_images)
+
+                shutil.copytree(result.image_dir, temp_images)
+
+                if final_images.exists():
+                    shutil.rmtree(final_images)
+                os.rename(temp_images, final_images)
+
+        finally:
+            if temp_md and os.path.exists(temp_md):
+                os.unlink(temp_md)
+            if temp_json and os.path.exists(temp_json):
+                os.unlink(temp_json)
+
+    def clear(self, pdf_path: str = None) -> bool:
+        """Clear cache for specific PDF (both fast and docling modes) or entire cache."""
+        if pdf_path:
+            # Clear BOTH fast and docling caches for this PDF
+            cleared = False
+            for docling_mode in [False, True]:
+                try:
+                    config = ExtractionConfig(pdf_path=pdf_path, docling=docling_mode)
+                    cache_key = self.get_key(config)
+                    cache_dir = self._get_dir(cache_key)
+                    if cache_dir.exists():
+                        shutil.rmtree(cache_dir)
+                        cleared = True
+                except (FileNotFoundError, OSError):
+                    pass
+            return cleared
+        else:
+            if self.cache_dir.exists():
+                shutil.rmtree(self.cache_dir)
+                return True
+            return False
+
+    def get_stats(self) -> dict:
+        """Get statistics about the cache."""
+        if not self.cache_dir.exists():
+            return {"entries": 0, "total_size_mb": 0, "cache_dir": str(self.cache_dir)}
+
+        entries = 0
+        total_size = 0
+
+        for entry in self.cache_dir.iterdir():
+            if entry.is_dir():
+                entries += 1
+                for f in entry.rglob("*"):
+                    if f.is_file():
+                        total_size += f.stat().st_size
+
+        return {
+            "entries": entries,
+            "total_size_mb": round(total_size / (1024 * 1024), 2),
+            "cache_dir": str(self.cache_dir),
+        }
+
+
+# =============================================================================
+# IMAGE MANAGER
+# =============================================================================
+
+
+class ImageManager:
+    """Manages image extraction and cleanup."""
+
+    def __init__(self):
+        self._temp_dirs: list[Path] = []
+
+    def create_temp_dir(self, pdf_path: str) -> Path:
+        """Create tracked temp directory for image extraction."""
+        pdf_name = Path(pdf_path).stem
+        safe_name = re.sub(r"[^\w\-_]", "_", pdf_name)
+        temp_dir = Path(tempfile.mkdtemp(prefix=f"pdf_images_{safe_name}_"))
+        self._temp_dirs.append(temp_dir)
+        return temp_dir
+
+    def cleanup(self):
+        """Clean up all tracked temp directories."""
+        for temp_dir in self._temp_dirs:
+            if temp_dir.exists():
+                shutil.rmtree(temp_dir)
+        self._temp_dirs.clear()
+
+    def extract_references(self, markdown: str) -> set:
+        """Extract the set of image filenames referenced in markdown."""
+        pattern = r"!\[[^\]]*\]\(([^)]+)\)"
+        matches = re.findall(pattern, markdown)
+        return {Path(m).name for m in matches}
+
+    def get_info(self, image_dir: Path, referenced_only: set = None) -> list:
+        """Get information about extracted images."""
+        if not image_dir or not Path(image_dir).exists():
+            return []
+
+        image_dir = Path(image_dir)
+        images = []
+
+        for img_path in sorted(image_dir.glob("*")):
+            if img_path.suffix.lower() in (".png", ".jpg", ".jpeg", ".gif", ".bmp", ".webp"):
+                if referenced_only is not None and img_path.name not in referenced_only:
+                    continue
+
+                try:
+                    size_bytes = img_path.stat().st_size
+                    size_kb = size_bytes / 1024
+
+                    try:
+                        import pymupdf
+                        pix = pymupdf.Pixmap(str(img_path))
+                        dimensions = f"{pix.width}x{pix.height}"
+                        pix = None
+                    except Exception:
+                        dimensions = "unknown"
+
+                    images.append({
+                        "filename": img_path.name,
+                        "path": str(img_path),
+                        "size_kb": round(size_kb, 1),
+                        "dimensions": dimensions,
+                    })
+                except Exception:
+                    pass
+
+        return images
+
+    def enhance_markdown(self, markdown: str, image_dir: Path) -> str:
+        """Rewrite image references to use relative paths (portable, Windows-safe)."""
+        if not image_dir:
+            return markdown
+
+        image_dir = Path(image_dir)
+
+        def replace_image_ref(match):
+            alt_text = match.group(1)
+            filename_raw = match.group(2)
+            filename = Path(filename_raw).name
+            full_path = image_dir / filename
+
+            # Use relative path for portability (POSIX format for Windows compatibility)
+            relative_path = Path("images") / filename
+
+            if full_path.exists():
+                try:
+                    size_kb = round(full_path.stat().st_size / 1024, 1)
+                    try:
+                        import pymupdf
+                        pix = pymupdf.Pixmap(str(full_path))
+                        dims = f"{pix.width}x{pix.height}"
+                        pix = None
+                    except Exception:
+                        dims = "?"
+
+                    return f"![{alt_text}]({relative_path.as_posix()})\n\n**[Image: {filename} ({dims}, {size_kb}KB)]**"
+                except Exception:
+                    return f"![{alt_text}]({relative_path.as_posix()})\n\n**[Image: {filename}]**"
+
+            return match.group(0)
+
+        pattern = r"!\[([^\]]*)\]\(([^)]+)\)"
+        return re.sub(pattern, replace_image_ref, markdown)
+
+    def create_summary(self, images: list) -> str:
+        """Create a summary section listing all extracted images."""
+        if not images:
+            return ""
+
+        lines = [
+            "",
+            "---",
+            "",
+            "## Extracted Images",
+            "",
+            "| # | File | Dimensions | Size |",
+            "|---|------|------------|------|",
+        ]
+
+        for i, img in enumerate(images, 1):
+            lines.append(
+                f"| {i} | {img['filename']} | {img['dimensions']} | {img['size_kb']}KB |"
+            )
+
+        lines.append("")
+        return "\n".join(lines)
+
+    def finalize_images(
+        self, temp_dir: Path, cache_dir: Path, output_path: Path, show_progress: bool = False
+    ) -> Path | None:
+        """Finalize image directory after extraction.
+
+        Copies images from cache to output location (next to the markdown file).
+        Cleans up temp directories.
+
+        Returns the final image directory (next to output) for reference.
+        """
+        if not temp_dir:
+            return None
+
+        temp_dir = Path(temp_dir)
+        output_path = Path(output_path)
+        output_images_dir = (
+            output_path.parent / "images" if output_path.suffix else output_path / "images"
+        )
+
+        # Clean up empty temp directories
+        if not temp_dir.exists() or not any(temp_dir.iterdir()):
+            if temp_dir.exists():
+                shutil.rmtree(temp_dir)
+            if temp_dir in self._temp_dirs:
+                self._temp_dirs.remove(temp_dir)
+            if output_images_dir.exists():
+                shutil.rmtree(output_images_dir)
+            return None
+
+        # Clean up temp directory (images are saved to cache)
+        if temp_dir.exists():
+            shutil.rmtree(temp_dir)
+            if temp_dir in self._temp_dirs:
+                self._temp_dirs.remove(temp_dir)
+
+        # Copy images from cache to output location
+        if cache_dir:
+            cached_image_dir = cache_dir / "images"
+            if cached_image_dir.exists() and any(cached_image_dir.iterdir()):
+                return self._copy_images_to_output(cached_image_dir, output_path, show_progress)
+
+        if output_images_dir.exists():
+            shutil.rmtree(output_images_dir)
+
+        return None
+
+    def _copy_images_to_output(
+        self, source_dir: Path, output_path: Path, show_progress: bool = False
+    ) -> Path | None:
+        """Copy images from cache to output location (next to markdown file)."""
+        output_path = Path(output_path)
+
+        # Determine output images directory (sibling to markdown file)
+        if output_path.suffix:  # It's a file path like "output.md"
+            output_images_dir = output_path.parent / "images"
+        else:  # It's a directory
+            output_images_dir = output_path / "images"
+
+        # Don't copy if already at output location
+        if output_images_dir.resolve() == Path(source_dir).resolve():
+            return output_images_dir
+
+        # Replace previously generated content so stale images do not linger.
+        if output_images_dir.exists():
+            shutil.rmtree(output_images_dir)
+
+        output_images_dir.mkdir(parents=True, exist_ok=True)
+        copied_count = 0
+        for img in source_dir.iterdir():
+            if img.is_file():
+                shutil.copy2(img, output_images_dir / img.name)
+                copied_count += 1
+
+        if show_progress and copied_count > 0:
+            print(f"Copied {copied_count} images to: {output_images_dir}", file=sys.stderr)
+
+        return output_images_dir
+
+
+# =============================================================================
+# PDF PROCESSING
+# =============================================================================
+
+
+def check_dependencies(docling_mode: bool = False):
+    """Check if required packages are installed."""
+    missing = []
+
+    try:
+        import pymupdf
+    except ImportError:
+        missing.append("pymupdf")
+
+    if docling_mode:
+        try:
+            import docling
+        except ImportError:
+            missing.append("docling")
+
+        try:
+            import docling_core
+        except ImportError:
+            missing.append("docling-core")
+
+        install_cmd = "uv pip install pymupdf docling docling-core"
+    else:
+        try:
+            import pymupdf4llm
+        except ImportError:
+            missing.append("pymupdf4llm")
+
+        install_cmd = "uv pip install pymupdf pymupdf4llm"
+
+    if missing:
+        print(f"ERROR: Missing dependencies: {', '.join(missing)}", file=sys.stderr)
+        print(f"Install with: {install_cmd}", file=sys.stderr)
+        return False
+
+    return True
+
+
+def convert_pdf(pdf_path, image_dir, show_progress=False, docling=False, images_scale=4.0):
+    """Convert PDF to markdown."""
+    if docling:
+        from extractor import extract_pdf_docling
+
+        markdown, _image_paths = extract_pdf_docling(
+            pdf_path,
+            output_dir=image_dir,
+            images_scale=images_scale,
+            show_progress=show_progress,
+        )
+        return markdown
+    else:
+        from extractor import extract_pdf_fast
+
+        markdown = extract_pdf_fast(
+            pdf_path,
+            image_dir=image_dir,
+            show_progress=show_progress,
+        )
+        return markdown
+
+
+def add_metadata_header(markdown, pdf_path, total_pages, image_dir=None, cached=False):
+    """Add metadata header to markdown output."""
+    filename = os.path.basename(pdf_path)
+
+    header_lines = [
+        "---",
+        f"source: {filename}",
+        f"total_pages: {total_pages}",
+        f"extracted_at: {datetime.now().isoformat()}",
+    ]
+
+    if cached:
+        header_lines.append("from_cache: true")
+
+    if image_dir:
+        # Use relative path for portability
+        header_lines.append("images_dir: images")
+
+    header_lines.extend(["---", "", ""])
+
+    return "\n".join(header_lines) + markdown
+
+
+# =============================================================================
+# MAIN
+# =============================================================================
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Convert PDF to Markdown for LLM context (with persistent caching)",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python pdf_to_md.py document.pdf                    # Output to document.md (cached)
+  python pdf_to_md.py document.pdf output.md         # Custom output path
+  python pdf_to_md.py document.pdf --docling         # Accurate tables (slower)
+  python pdf_to_md.py document.pdf --clear-cache     # Clear cache and re-extract
+  python pdf_to_md.py --clear-all-cache              # Clear entire cache
+
+Caching:
+  PDFs are cached in ~/.cache/pdf-to-markdown/
+  Cache is keyed by file content hash + extraction mode.
+  Cache persists until explicitly cleared or source PDF changes.
+        """,
+    )
+
+    parser.add_argument(
+        "input", nargs="?", help="Input PDF file path or folder containing one PDF"
+    )
+    parser.add_argument(
+        "output", nargs="?", help="Output markdown file path (default: <resolved-input>.md)"
+    )
+    parser.add_argument(
+        "--docling",
+        "--accurate",
+        action="store_true",
+        dest="docling",
+        help="Use Docling AI for complex/borderless tables (slower, ~1 sec/page)",
+    )
+    parser.add_argument("--no-progress", action="store_true", help="Disable progress indicator")
+
+    # Cache options
+    parser.add_argument(
+        "--clear-cache",
+        action="store_true",
+        help="Clear cache for this PDF before processing",
+    )
+    parser.add_argument(
+        "--clear-all-cache",
+        action="store_true",
+        help="Clear entire cache directory and exit",
+    )
+    parser.add_argument("--cache-stats", action="store_true", help="Show cache statistics and exit")
+
+    args = parser.parse_args()
+
+    cache_mgr = CacheManager()
+
+    # Handle cache management commands
+    if args.clear_all_cache:
+        if cache_mgr.clear():
+            print(f"Cache cleared: {cache_mgr.cache_dir}", file=sys.stderr)
+        else:
+            print("Cache was already empty.", file=sys.stderr)
+        sys.exit(0)
+
+    if args.cache_stats:
+        stats = cache_mgr.get_stats()
+        print(f"Cache directory: {stats['cache_dir']}", file=sys.stderr)
+        print(f"Cached PDFs: {stats['entries']}", file=sys.stderr)
+        print(f"Total size: {stats['total_size_mb']} MB", file=sys.stderr)
+        sys.exit(0)
+
+    # Require input for all other operations
+    if not args.input:
+        parser.error("the following arguments are required: input")
+
+    # Handle --clear-cache
+    if args.clear_cache:
+        try:
+            resolved_input = resolve_input_pdf(args.input)
+        except (FileNotFoundError, ValueError) as exc:
+            print(f"ERROR: {exc}", file=sys.stderr)
+            sys.exit(1)
+
+        if cache_mgr.clear(str(resolved_input)):
+            print(f"Cache cleared for: {resolved_input}", file=sys.stderr)
+        else:
+            print(f"No cache found for: {resolved_input}", file=sys.stderr)
+
+    try:
+        input_pdf = resolve_input_pdf(args.input)
+    except (FileNotFoundError, ValueError) as exc:
+        print(f"ERROR: {exc}", file=sys.stderr)
+        sys.exit(1)
+
+    if input_pdf.suffix.lower() != ".pdf":
+        print(f"WARNING: File may not be a PDF: {input_pdf}", file=sys.stderr)
+
+    show_progress = sys.stderr.isatty() and not args.no_progress
+
+    # Check cache
+    config = ExtractionConfig(pdf_path=str(input_pdf), docling=args.docling)
+    valid, cache_key = cache_mgr.is_valid(config)
+
+    result = None
+    image_dir = None
+    cache_hit = False
+
+    if valid:
+        if show_progress:
+            mode = "docling" if args.docling else "fast"
+            print(f"Loading from cache ({mode} mode)...", file=sys.stderr)
+
+        cache_result = cache_mgr.load(cache_key)
+        if cache_result:
+            result = cache_result.markdown
+            total_pages = cache_result.total_pages
+            cache_hit = True
+
+            # Copy images from cache to output location
+            if cache_result.image_dir:
+                output_path = args.output or str(input_pdf.with_suffix(".md"))
+                img_mgr = ImageManager()
+                image_dir = img_mgr._copy_images_to_output(
+                    cache_result.image_dir, output_path, show_progress
+                )
+
+    # Extract if no cache hit
+    if not cache_hit:
+        if not check_dependencies(docling_mode=args.docling):
+            sys.exit(1)
+
+        from extractor import get_page_count
+
+        total_pages = get_page_count(str(input_pdf))
+
+        if not cache_key:
+            cache_key = cache_mgr.get_key(config)
+
+        img_mgr = ImageManager()
+        temp_image_dir = img_mgr.create_temp_dir(str(input_pdf))
+
+        try:
+            if show_progress:
+                if args.docling:
+                    print(
+                        f"Extracting {total_pages} pages with Docling AI (~1 sec/page)...",
+                        file=sys.stderr,
+                    )
+                else:
+                    print(
+                        f"Extracting {total_pages} pages with PyMuPDF (fast mode)...",
+                        file=sys.stderr,
+                    )
+
+            result = convert_pdf(
+                str(input_pdf),
+                image_dir=temp_image_dir,
+                show_progress=show_progress,
+                docling=args.docling,
+            )
+        except Exception as e:
+            img_mgr.cleanup()
+            print(f"ERROR: Conversion failed: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        # Save to cache
+        extraction_result = ExtractionResult(
+            markdown=result,
+            image_dir=temp_image_dir,
+            total_pages=total_pages,
+        )
+        cache_mgr.save(cache_key, extraction_result, config)
+        if show_progress:
+            print(f"Cached: {cache_mgr._get_dir(cache_key)}", file=sys.stderr)
+
+        # Finalize images
+        output_path = args.output or str(input_pdf.with_suffix(".md"))
+        image_dir = img_mgr.finalize_images(
+            temp_dir=temp_image_dir,
+            cache_dir=cache_mgr._get_dir(cache_key),
+            output_path=output_path,
+            show_progress=show_progress,
+        )
+
+    # Format output
+    output = result
+    img_mgr_for_output = ImageManager()  # Fresh instance for output processing
+
+    referenced_images = img_mgr_for_output.extract_references(result) if result else set()
+
+    if image_dir:
+        output = img_mgr_for_output.enhance_markdown(output, image_dir)
+        images = img_mgr_for_output.get_info(image_dir, referenced_only=referenced_images)
+        if images:
+            output += img_mgr_for_output.create_summary(images)
+
+    output = add_metadata_header(
+        output, str(input_pdf), total_pages, image_dir, cached=cache_hit
+    )
+
+    # Write output
+    output_path = args.output or str(input_pdf.with_suffix(".md"))
+    with open(output_path, "w", encoding="utf-8") as f:
+        f.write(output)
+
+    msg = f"Converted {total_pages} pages to: {output_path}"
+    if cache_hit:
+        msg += " (from cache)"
+    if image_dir:
+        images = img_mgr_for_output.get_info(image_dir, referenced_only=referenced_images)
+        if images:
+            msg += f" ({len(images)} images)"
+    print(msg, file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()