diff --git a/.agents/skills/protocol-migration/SKILL.md b/.agents/skills/protocol-migration/SKILL.md index ea002fe..6c36050 100644 --- a/.agents/skills/protocol-migration/SKILL.md +++ b/.agents/skills/protocol-migration/SKILL.md @@ -13,6 +13,7 @@ Do not change protocol meaning. Use `legacy/source.md` as the primary source when rewriting `README.md`. Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear. Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous. +Treat image references in `legacy/source.md` and files in `legacy/images/` as extracted protocol content, not decorative artifacts, until they have been reviewed. If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker. ## Migration behavior @@ -32,6 +33,23 @@ When converting legacy protocol content into the repository template: - If any text does not fit cleanly into the template, place it under `# Migration notes` or `## Unplaced content`. - Mark uncertainty with `CHECK:` instead of guessing. +## Images, figures, and image-based tables +PDF-to-Markdown conversion may extract protocol-relevant content as images, especially tables, thermocycler programs, reagent layouts, flow diagrams, gel/example images, or figure panels. + +When migrating: + +- Scan `legacy/source.md` for image references such as `![](images/...)`, and inspect `legacy/images/` for extracted images. +- Keep images that contain protocol content needed to perform or interpret the protocol. +- Omit only clearly decorative images such as logos, icons, page chrome, ornamental separators, or duplicated images that add no protocol content. +- Prefer converting image-based tables into Markdown tables when all headers, rows, values, units, grouping, and notes are legible and unambiguous. +- Preserve row grouping and cycle counts from thermocycler/program tables. If Markdown cannot represent the original grouping cleanly, add a short note or use repeated values rather than losing meaning. +- Do not OCR or transcribe illegible values by guesswork. If any value, header, grouping, or placement is uncertain, keep the image and add `CHECK:`. +- If an image contains non-tabular protocol content that cannot be safely converted to text, include the image in `README.md`. +- Place converted tables or retained images at the same logical location as the source image, near the step or section they support. +- In `README.md`, image paths must be valid from the repository root. Change extracted paths from `images/` to `legacy/images/` unless the image has deliberately been moved. +- Use descriptive alt text, for example `![Thermocycler program](legacy/images/page-3-image-2.png)`, not empty alt text. +- Mention in `# Migration notes` which extracted images were converted to Markdown tables, which were retained as images, and which were omitted as non-protocol/decorative. + ## Allowed formatting normalization You may normalize formatting only when the meaning is unchanged and unambiguous: @@ -49,6 +67,7 @@ You may normalize formatting only when the meaning is unchanged and unambiguous: - Normalize bullet formatting and markdown table formatting. - Normalize heading structure to match the repository template. - For reaction mixes and anything tabular, place them inside a table as in template. +- For image-based tables, convert to Markdown tables wherever this is legible and unambiguous; otherwise retain the image at the correct protocol location. - Normalize markdown headings, bullets, and tables. - "Note" or "NOTE" or "NB" or "Optional" or "Recommended" or "Warning" are normalized to start with `>` (example `> **Note**`) and are placed immediately after the step they refer to, or at the end of the protocol if they clearly refer to the whole protocol. - Remove empty columns from tables. @@ -67,6 +86,7 @@ You may normalize formatting only when the meaning is unchanged and unambiguous: - Do not replace one reagent name with another. - Do not remove repeated warnings or notes. - Do not omit unmapped text. +- Do not omit protocol-relevant images, image-based tables, figures, diagrams, or visual instructions. ## Output requirements - edit `README.md` @@ -86,6 +106,9 @@ You may normalize formatting only when the meaning is unchanged and unambiguous: - template_version from `template-metadata.yml` - ambiguous mappings - normalized formatting changes + - extracted images converted to Markdown tables + - extracted images retained in `README.md` + - extracted images omitted because they were decorative or duplicated non-protocol content - content copied verbatim but not confidently placed - keep the template badge at the top - keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file @@ -97,7 +120,10 @@ After drafting, verify the migration against the source: - compare the migrated `README.md` against `legacy/source.md` - compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt` - compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity +- compare every protocol-relevant image reference in `legacy/source.md` and every relevant file in `legacy/images/` against the migrated `README.md` - check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present +- check that image-based tables were converted accurately or retained as images with valid `legacy/images/...` paths +- check that no protocol-relevant figure, table image, diagram, gel/example image, or visual instruction was silently omitted - check that no source content has been silently omitted, merged, or reordered without justification - check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/` - leave `CHECK:` anywhere the mapping is uncertain rather than guessing @@ -108,6 +134,8 @@ Verification checklist: - no protocol steps or warnings were omitted - no values were invented or made more precise than in the source - tables and layout-dependent content were checked against the PDF in `legacy/` +- protocol-relevant extracted images were either converted to Markdown tables or retained at the correct location +- retained image links resolve from `README.md` - any uncertain mappings are marked with `CHECK:` - any meaningful normalization choices are noted in `# Migration notes` diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 59337cd..bfa8fef 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -7,6 +7,7 @@ Do not change protocol meaning. Use `legacy/source.md` as the primary source when rewriting `README.md`. Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear. Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous. +Treat image references in `legacy/source.md` and files in `legacy/images/` as extracted protocol content, not decorative artifacts, until they have been reviewed. If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker. ## Migration behavior @@ -26,6 +27,22 @@ When converting legacy protocol content into the repository template: - Preserve the step order from the source unless the source clearly indicates otherwise. - Preserve exact reagent and equipment names unless only formatting is changing. +## Images, figures, and image-based tables +PDF-to-Markdown conversion may extract protocol-relevant content as images, especially tables, thermocycler programs, reagent layouts, flow diagrams, gel/example images, or figure panels. + +When migrating: +- Scan `legacy/source.md` for image references such as `![](images/...)`, and inspect `legacy/images/` for extracted images. +- Keep images that contain protocol content needed to perform or interpret the protocol. +- Omit only clearly decorative images such as logos, icons, page chrome, ornamental separators, or duplicated images that add no protocol content. +- Prefer converting image-based tables into Markdown tables when all headers, rows, values, units, grouping, and notes are legible and unambiguous. +- Preserve row grouping and cycle counts from thermocycler/program tables. If Markdown cannot represent the original grouping cleanly, add a short note or use repeated values rather than losing meaning. +- Do not OCR or transcribe illegible values by guesswork. If any value, header, grouping, or placement is uncertain, keep the image and add `CHECK:`. +- If an image contains non-tabular protocol content that cannot be safely converted to text, include the image in `README.md`. +- Place converted tables or retained images at the same logical location as the source image, near the step or section they support. +- In `README.md`, image paths must be valid from the repository root. Change extracted paths from `images/` to `legacy/images/` unless the image has deliberately been moved. +- Use descriptive alt text, for example `![Thermocycler program](legacy/images/page-3-image-2.png)`, not empty alt text. +- Mention in `# Migration notes` which extracted images were converted to Markdown tables, which were retained as images, and which were omitted as non-protocol/decorative. + ## Allowed formatting normalization You may normalize formatting only when the meaning is unchanged and unambiguous: - Add a space between numbers and units. @@ -42,6 +59,7 @@ You may normalize formatting only when the meaning is unchanged and unambiguous: - Normalize bullet formatting and markdown table formatting. - Normalize heading structure to match the repository template. - For reaction mixes and anything tabular, place them inside a table as in template. +- For image-based tables, convert to Markdown tables wherever this is legible and unambiguous; otherwise retain the image at the correct protocol location. - Normalize markdown headings, bullets, and tables. - "Note" or "NOTE" or "NB" or "Optional" or "Recommended" or "Warning" are normalized to start with `>` (example `> **Note**`) and are placed immediately after the step they refer to, or at the end of the protocol if they clearly refer to the whole protocol. - Remove empty columns from tables. @@ -60,6 +78,7 @@ You may normalize formatting only when the meaning is unchanged and unambiguous: - Do not replace one reagent name with another. - Do not remove repeated warnings or notes. - Do not omit unmapped text. +- Do not omit protocol-relevant images, image-based tables, figures, diagrams, or visual instructions. ## Output requirements When drafting a migrated protocol: @@ -77,6 +96,9 @@ When drafting a migrated protocol: - template_version from `template-metadata.yml`. - ambiguous mappings. - normalized formatting changes. + - extracted images converted to Markdown tables. + - extracted images retained in `README.md`. + - extracted images omitted because they were decorative or duplicated non-protocol content. - content copied verbatim but not confidently placed. - Keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file. - Delete the "Template repository: Click `Use this template` to create a new protocol repo..." note. @@ -88,7 +110,10 @@ After drafting, verify the migration against the source: - compare the migrated `README.md` against `legacy/source.md` - compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt` - compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity +- compare every protocol-relevant image reference in `legacy/source.md` and every relevant file in `legacy/images/` against the migrated `README.md` - check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present +- check that image-based tables were converted accurately or retained as images with valid `legacy/images/...` paths +- check that no protocol-relevant figure, table image, diagram, gel/example image, or visual instruction was silently omitted - check that no source content has been silently omitted, merged, or reordered without justification - check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/` - leave `CHECK:` anywhere the mapping is uncertain rather than guessing @@ -99,6 +124,8 @@ Verification checklist: - no protocol steps or warnings were omitted - no values were invented or made more precise than in the source - tables and layout-dependent content were checked against the PDF in `legacy/` +- protocol-relevant extracted images were either converted to Markdown tables or retained at the correct location +- retained image links resolve from `README.md` - any uncertain mappings are marked with `CHECK:` - any meaningful normalization choices are noted in `# Migration notes` diff --git a/.github/workflows/fix-protocol-style.yml b/.github/workflows/fix-protocol-style.yml new file mode 100644 index 0000000..edceccc --- /dev/null +++ b/.github/workflows/fix-protocol-style.yml @@ -0,0 +1,60 @@ +name: fix-protocol-style + +on: + workflow_dispatch: + inputs: + base_branch: + description: Branch to fix and open a PR against + required: true + default: main + +permissions: + contents: write + pull-requests: write + +concurrency: + group: fix-protocol-style-${{ github.event.inputs.base_branch || 'main' }} + cancel-in-progress: false + +jobs: + fix-style: + runs-on: ubuntu-latest + + steps: + - name: Check out target branch + uses: actions/checkout@v4 + with: + fetch-depth: 0 + ref: ${{ github.event.inputs.base_branch }} + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.11" + + - name: Run unit tests + run: | + python -m unittest discover -s tests -p 'test_*.py' + + - name: Apply style fixer + run: | + python scripts/fix_protocol_style.py README.md + + - name: Verify README style after fixing + run: | + python scripts/validate_protocol_style.py README.md + + - name: Create pull request with style fixes + uses: peter-evans/create-pull-request@v7 + with: + base: ${{ github.event.inputs.base_branch }} + branch: automation/fix-protocol-style/${{ github.event.inputs.base_branch }} + delete-branch: true + commit-message: Normalize protocol README style + title: Normalize protocol README style + body: | + This PR was opened automatically by the style-fix workflow. + + It applies deterministic README style fixes from `scripts/fix_protocol_style.py` and re-validates the result with `scripts/validate_protocol_style.py`. + add-paths: | + README.md diff --git a/.github/workflows/validate-protocol.yml b/.github/workflows/validate-protocol.yml index 1115451..467f2f1 100644 --- a/.github/workflows/validate-protocol.yml +++ b/.github/workflows/validate-protocol.yml @@ -6,23 +6,29 @@ on: - main paths: - README.md + - scripts/fix_protocol_style.py - scripts/validate_protocol.py - scripts/validate_protocol_content.py - scripts/validate_protocol_style.py + - tests/test_fix_protocol_style.py - tests/test_validate_protocol_content.py - tests/test_validate_protocol_style.py - .github/workflows/validate-protocol.yml + - .github/workflows/fix-protocol-style.yml push: branches: - main paths: - README.md + - scripts/fix_protocol_style.py - scripts/validate_protocol.py - scripts/validate_protocol_content.py - scripts/validate_protocol_style.py + - tests/test_fix_protocol_style.py - tests/test_validate_protocol_content.py - tests/test_validate_protocol_style.py - .github/workflows/validate-protocol.yml + - .github/workflows/fix-protocol-style.yml workflow_dispatch: jobs: diff --git a/docs/PROMPT.md b/docs/PROMPT.md index 48b29d6..0af662c 100644 --- a/docs/PROMPT.md +++ b/docs/PROMPT.md @@ -10,6 +10,7 @@ Do not change protocol meaning. Use `legacy/source.md` as the primary source when rewriting `README.md`. Use `legacy/source.txt` only as a fallback when `legacy/source.md` looks malformed, incomplete, or unclear. Use the PDF file in `legacy/` as the final reference source of truth for tables, figures, layout-dependent content, and anything still ambiguous after checking the generated text sources. +Treat image references in `legacy/source.md` and files in `legacy/images/` as extracted protocol content, not decorative artifacts, until they have been reviewed. If `legacy/source.md` and `legacy/source.txt` disagree, prefer `legacy/source.md` for general structure and prose, but use the original PDF as the final tie-breaker. ## Migration behavior @@ -29,6 +30,23 @@ When converting legacy protocol content into the repository template: - If any text does not fit cleanly into the template, place it under `# Migration notes` or `## Unplaced content`. - Mark uncertainty with `CHECK:` instead of guessing. +## Images, figures, and image-based tables +PDF-to-Markdown conversion may extract protocol-relevant content as images, especially tables, thermocycler programs, reagent layouts, flow diagrams, gel/example images, or figure panels. + +When migrating: + +- scan `legacy/source.md` for image references such as `![](images/...)`, and inspect `legacy/images/` for extracted images +- keep images that contain protocol content needed to perform or interpret the protocol +- omit only clearly decorative images such as logos, icons, page chrome, ornamental separators, or duplicated images that add no protocol content +- prefer converting image-based tables into Markdown tables when all headers, rows, values, units, grouping, and notes are legible and unambiguous +- preserve row grouping and cycle counts from thermocycler/program tables. If Markdown cannot represent the original grouping cleanly, add a short note or use repeated values rather than losing meaning +- do not OCR or transcribe illegible values by guesswork. If any value, header, grouping, or placement is uncertain, keep the image and add `CHECK:` +- if an image contains non-tabular protocol content that cannot be safely converted to text, include the image in `README.md` +- place converted tables or retained images at the same logical location as the source image, near the step or section they support +- in `README.md`, image paths must be valid from the repository root. Change extracted paths from `images/` to `legacy/images/` unless the image has deliberately been moved +- use descriptive alt text, for example `![Thermocycler program](legacy/images/page-3-image-2.png)`, not empty alt text +- mention in `# Migration notes` which extracted images were converted to Markdown tables, which were retained as images, and which were omitted as non-protocol/decorative + ## Allowed formatting normalization Normalize formatting only when the meaning is unchanged and unambiguous: @@ -45,6 +63,7 @@ Normalize formatting only when the meaning is unchanged and unambiguous: - Use numbered lists for procedural actions in sequence. For other non-procedural content, bullets are better. Note-like text such as Note, NB, Optional, Recommended, and Warning should use blockquote style such as `> **Note**`. - normalize bullets, headings, and markdown tables to match the repository template - use tables for reaction mixes and other tabular content +- convert image-based tables to Markdown tables wherever this is legible and unambiguous; otherwise retain the image at the correct protocol location - normalize note-like text such as Note, NB, Optional, Recommended, and Warning to blockquote style, for example `> **Note**` - place note-like text immediately after the step it refers to, or at the end of the protocol if it clearly refers to the whole protocol - remove empty columns from tables @@ -63,6 +82,7 @@ Normalize formatting only when the meaning is unchanged and unambiguous: - do not replace one reagent name with another - do not remove repeated warnings or notes - do not omit unmapped text +- do not omit protocol-relevant images, image-based tables, figures, diagrams, or visual instructions ## Output requirements - Only edit `README.md`. @@ -82,6 +102,9 @@ Normalize formatting only when the meaning is unchanged and unambiguous: - template_version from `template-metadata.yml` - ambiguous mappings - normalized formatting changes + - extracted images converted to Markdown tables + - extracted images retained in `README.md` + - extracted images omitted because they were decorative or duplicated non-protocol content - content copied verbatim but not confidently placed - Keep ![Created with ulelab Protocol Template](https://img.shields.io/badge/created%20with-ulelab%20Protocol%20Template-blue) at the top of the file. - Remove the template instruction note. @@ -93,7 +116,10 @@ After drafting, verify the migration against the source: - compare the migrated `README.md` against `legacy/source.md` - compare any malformed, incomplete, or ambiguous passages against `legacy/source.txt` - compare the migrated `README.md` against the PDF in `legacy/` for tables, figures, layout-dependent content, and any remaining ambiguity +- compare every protocol-relevant image reference in `legacy/source.md` and every relevant file in `legacy/images/` against the migrated `README.md` - check that all protocol steps, notes, warnings, reagent names, quantities, temperatures, timings, and conditions are still present +- check that image-based tables were converted accurately or retained as images with valid `legacy/images/...` paths +- check that no protocol-relevant figure, table image, diagram, gel/example image, or visual instruction was silently omitted - check that no source content has been silently omitted, merged, or reordered without justification - check any tables, layout-dependent content, or ambiguous sections against the PDF in `legacy/` - leave `CHECK:` anywhere the mapping is uncertain rather than guessing @@ -104,6 +130,8 @@ Verification checklist: - no protocol steps or warnings were omitted - no values were invented or made more precise than in the source - tables and layout-dependent content were checked against the PDF in `legacy/` +- protocol-relevant extracted images were either converted to Markdown tables or retained at the correct location +- retained image links resolve from `README.md` - any uncertain mappings are marked with `CHECK:` - any meaningful normalization choices are noted in `# Migration notes` diff --git a/docs/USING_THIS_TEMPLATE.md b/docs/USING_THIS_TEMPLATE.md index 4a9fa54..be1d9ca 100644 --- a/docs/USING_THIS_TEMPLATE.md +++ b/docs/USING_THIS_TEMPLATE.md @@ -101,18 +101,16 @@ This route can save time. It helps keep the template structure consistent, norma 3. Upload the legacy PDF to the `legacy` folder, then commit and push it. > **Important**: Please use a high-quality, well-structured protocol as the source. Only one PDF file per protocol is supported. -> **Warning**: Custom content in protcols (such as images) may represent a challenge for this route, you may need to add it manually. +> **Warning**: Custom content in protocols may represent a challenge for this route. Protocol-relevant extracted images, such as table images, figures, diagrams, or visual instructions, should either be converted to Markdown when legible and unambiguous, or retained in `README.md` at the correct location. > **Recommended**: Also fill in the `source-metadata.yml`, even if not fully. Helps track source protocol provenance. - -4. Keep exactly **one** PDF in the `legacy` folder, otherwise the process will fail. -5. Once you push a PDF change in the `legacy` folder to a non-`main` branch, the migration GitHub Actions will run. `pdf-to-text` writes `legacy/source.txt`, and `pdf-to-markdown` writes `legacy/source.md`. Check that these files were created before the next step. +4. Keep exactly one PDF in the `legacy` folder, otherwise the process will fail. +5. Once you push a PDF change in the `legacy` folder to a non-`main` branch, the migration GitHub Actions will run. `pdf-to-text` writes `legacy/source.txt`, and `pdf-to-markdown` writes `legacy/source.md` and may write extracted images to `legacy/images/`. Check that these files were created before the next step. 6. Clone the repo locally, and switch to `import-protocol` branch. If you already have a local clone, run `git pull` to get the latest changes locally. > **Note**: Alternatively, you can complete steps 6-15 in GitHub Codespaces. On GitHub.com select the branch you want to work on, click **Code**, go to **Codespaces** tab and click **Create codespace on import-protocol**. This will open VS Code in a new browser tab, with all files loaded automatically. Note that this uses GitHub-hosted compute, and free usage is limited. - -7. Open the repo folder in a code editor and use GitHub Copilot or another LLM assistant. We recommend [VS Code](https://code.visualstudio.com/) or similar code editors. -8. Use the `protocol-migration` skill (or if you prefer, paste the prompt in `docs/PROMPT.md`) to ask GitHub Copilot or another LLM to rewrite `README.md`. The model will also follow the repository instructions in [`.github/copilot-instructions.md`](.github/copilot-instructions.md). This will edit the `README.md` file in-place, using `legacy/source.md` as the primary source, `legacy/source.txt` as a fallback when needed, and the legacy PDF as the final tie-breaker for tables, figures, and unclear layout-dependent content. -> **Note**: Use the best model you have access to. We tested capability with the Copilot Free Usage plan, and it works reasonably well, but advanced models will likely work even better, especially with more difficult documents. +7. Open the repo folder in a code editor and use GitHub Copilot or another LLM assistant. We recommend [VS Code](https://code.visualstudio.com/). +8. Use the `protocol-migration` skill (or if you prefer, paste the prompt in `docs/PROMPT.md`) to ask GitHub Copilot or another LLM to rewrite `README.md`. The model will also follow the repository instructions in [`.github/copilot-instructions.md`](.github/copilot-instructions.md). This will edit the `README.md` file in-place, using `legacy/source.md` as the primary source, `legacy/source.txt` as a fallback when needed, extracted images in `legacy/images/` as protocol content to review, and the legacy PDF as the final tie-breaker for tables, figures, and unclear layout-dependent content. +> **Note**: Use the best model you have access to. We tested capability with the Copilot Free Usage plan, and it works reasonably well, but advanced models will likely work even better. **In VS Code**: - **Codex**: use `/skills` and select the `protocol-migration` skill, or enter `$protocol-migration` in the Codex chat input box. @@ -122,7 +120,7 @@ This route can save time. It helps keep the template structure consistent, norma 9. Review the changes. If most of them look reasonable, commit with a message like `migration by LLM`. 10. Verify that `README.md` is accurate by comparing it to the original PDF and fix mistakes. -11. Check the `Migration notes` section and every place marked with `CHECK:`. Resolve anything unclear, and once resolved, delete the `CHECK:` markers. +11. Check the `Migration notes` section and every place marked with `CHECK:`. Confirm that protocol-relevant extracted images were converted to Markdown tables where possible, or retained as images with valid `legacy/images/...` paths. 12. Make any changes necessary. Delete sections you do not need. 13. Check that no `TODO` text remains. 14. Follow the guidelines in [3. General guidelines for the protocol file (`README.md`)](#3-general-guidelines-for-the-protocol-file-readmemd) diff --git a/scripts/fix_protocol_style.py b/scripts/fix_protocol_style.py new file mode 100644 index 0000000..52b3d7c --- /dev/null +++ b/scripts/fix_protocol_style.py @@ -0,0 +1,189 @@ +"""Apply deterministic style fixes to a protocol README.""" + +from pathlib import Path +import argparse +import re +from typing import Callable, Match + +NUMBER_RE = r"\d+(?:\.\d+)?" +TEMPERATURE_RE = re.compile( + rf"\b(?P{NUMBER_RE})(?P\s*)(?P°?)(?P\s*)(?P[Cc])\b" +) +PH_RE = re.compile(r"\b(?P