Skip to content

fix(scraper): tighten gamelist metadata scraping#858

Merged
wizzomafizzo merged 17 commits into
mainfrom
fix/scraper-companion-semantics
May 29, 2026
Merged

fix(scraper): tighten gamelist metadata scraping#858
wizzomafizzo merged 17 commits into
mainfrom
fix/scraper-companion-semantics

Conversation

@wizzomafizzo
Copy link
Copy Markdown
Member

@wizzomafizzo wizzomafizzo commented May 28, 2026

Summary

  • update scraper subsystem docs to match current platform scraper architecture
  • tighten ZaparooCompanion handling with atomic scrape writes, sentinels, safer matching, and aggregate progress
  • infer content types for path-backed media metadata/images
  • import regular gamelist region/lang as per-media tags using media-row matching

Validation

  • task lint-fix
  • task test

Summary by CodeRabbit

  • Documentation

    • Major rewrite clarifying scraper vs scanner roles, gamelist.xml semantics, companion-entry behavior, tag/property rules, sentinel/retry semantics, lifecycle, and focused test commands.
  • New Features

    • Added scrape controls (status, cancel, resume) and expanded progress fields; media meta responses include AvailableImageTypes.
  • Improvements

    • Prefer thumbnail, infer content-type/extension from file paths, expanded image mapping/fallbacks, path-normalized matching, rewritten companion handling, and mark stale image properties instead of deleting.
  • Tests

    • Added/updated targeted tests for image inference, preference/fallbacks, scraper matching, companion processing, and scrape APIs.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 09c13953-975a-4183-8be8-a9cbb507c4fa

📥 Commits

Reviewing files that changed from the base of the PR and between 7dd20bb and 06bdafc.

📒 Files selected for processing (1)
  • docs/scraper.md
✅ Files skipped from review due to trivial changes (1)
  • docs/scraper.md

📝 Walkthrough

Walkthrough

Adds MIME inference for path-backed media properties, extends media.meta with available image types, prefers thumbnail in image selection, changes stale-image handling to in-memory logging, refactors GamelistXML scraper for folded-path matching, ROM-level media tags, artwork fallback (including external roots), companion atomic writes, and updates tests/docs.

Changes

Content Type & Media API

Layer / File(s) Summary
Content type helper and tests
pkg/api/methods/media_content.go, pkg/api/methods/media_content_test.go
Adds mediaContentType to derive MIME type from file extension when ContentType is empty and a unit test verifying trimming behavior.
Image selection, stale handling & debug logs
pkg/api/methods/media_image.go, pkg/api/methods/media_image_test.go
Adds thumbnail to default preference order, computes sorted available image types, derives ContentType/Extension via mediaContentType, logs stale image properties instead of deleting them, simplifies file-backed loader signature, and improves debug logging and tests for preference selection and fallback.
Media meta response & helpers
pkg/api/methods/media_meta.go, pkg/api/methods/media_meta_test.go, pkg/api/models/responses.go
Derive property ContentType/Extension via mediaContentType, add AvailableImageTypes to media/meta responses, and implement availableImageTypes helper with tests.

Gamelist XML Scraper Path-Based Refactor

Layer / File(s) Summary
Scraper subsystem documentation
docs/scraper.md
Docs rewritten to describe enrichment-only scrapers, media.scrape lifecycle (status/cancel/resume), GamelistXML loop semantics, expanded field mappings, filesystem fallback rules, and ZaparooCompanion behavior.
Scraper structs & platform roots
pkg/database/scraper/gamelistxml/scraper.go
Extend GamelistRecord with MatchKind/MediaLevelWriteSafe, expand mediaDirCandidates, add loadRecordIndexes/companionStats, and initialize platform-specific external asset roots.
LoadRecords folded-path matching
pkg/database/scraper/gamelistxml/scraper.go
Change LoadRecords to accept loadRecordIndexes and match <game> entries by folded/normalized ROM paths (slug→title then path fallback), handle zip-as-dir matches, remove matched indexes to avoid duplicates, and emit match classification flags.
scrapeLoop integration & progress
pkg/database/scraper/gamelistxml/scraper.go
Process companions first (returns companionStats), build per-system indexes excluding already-scraped unless Force, call LoadRecords, offset progress by companion counters, conditionally omit ROM-level metadata when MediaLevelWriteSafe is false, and aggregate totals.
MapToDB ROM tags & artwork fallback
pkg/database/scraper/gamelistxml/scraper.go
Normalize Region/Lang into ROM-level MediaTags, compute artworkFallbackNames for ROM-relative lookup, thread fallback candidates into filesystem probing, and return MediaTags/MediaProps.
Path resolution & artwork probing helpers
pkg/database/scraper/gamelistxml/scraper.go
Add appendCSVTags, resolveESAssetPath/resolveESPathAbs/pathWithinRoot, and rework artwork probing to support external roots, folded-key matching, and zip-as-dir semantics.
Companion processing rewrite
pkg/database/scraper/gamelistxml/scraper.go
Detect companion games, convert parent MediaPropsTitleProps, preload media-by-title, resolve children by .slug/exact folded path/filename-suffix with ambiguity skipping, write one ApplyScrapeResult per child (sentinel + tags/props), and return companionStats.
Scraper test suite refactor
pkg/database/scraper/gamelistxml/scraper_test.go
Remove mediascanner dependency, add indexed helpers, expand LoadRecords path-matching tests (subfolders, zip-as-dir, slug/path fallback), move many artwork assertions to MediaProps, add external-asset-root tests, introduce sentinel-based matchers, and update scrapeLoop tests to mock GetMediaBySystemID/GetScrapedMediaIDs and assert writes/counters.

Sequence Diagram (high-level scraper write flow):

sequenceDiagram
  participant API as media.scrape (API)
  participant Scraper as GamelistXMLScraper
  participant Loader as LoadRecords
  participant Mapper as MapToDB
  participant DB as database.ApplyScrapeResult

  API->>Scraper: start scrape request
  Scraper->>Loader: loadRecordIndexes (media/title indexes)
  Loader->>Mapper: Map matched game -> MapResult (TitleProps/MediaProps, MediaTags)
  Mapper->>DB: ApplyScrapeResult(scrape write with sentinel + tags/props)
  DB-->>Scraper: write result (ok/error)
  Scraper-->>API: progress/status updates
Loading

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

"🐰 I sniffed the filenames with care,
Slugs and paths now fold and pair,
Thumbnails first, types found by name,
Companions written in one gentle frame,
Docs sing the steps and tests proclaim."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(scraper): tighten gamelist metadata scraping' directly describes the main objective—to tighten gamelist metadata scraping with improved companion handling, content-type inference, and region/language tag imports.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/scraper-companion-semantics

Comment @coderabbitai help to get the list of available commands and usage tips.

@sentry
Copy link
Copy Markdown

sentry Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 74.86819% with 143 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/database/scraper/gamelistxml/scraper.go 70.92% 113 Missing and 28 partials ⚠️
pkg/api/methods/media_image.go 95.12% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/database/scraper/gamelistxml/scraper_test.go (1)

1381-1389: ⚡ Quick win

Strengthen companionWriteMatcher to validate parent title metadata too.

The matcher currently ignores TitleTags and TitleProps, so atomic-write regressions on companion parent metadata would still pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper_test.go` around lines 1381 - 1389,
The matcher companionWriteMatcher currently only checks Sentinel and MediaTags;
modify it to also validate the parent title metadata on the captured
*database.ScrapeWrite*. Change the signature of companionWriteMatcher to accept
expected title metadata (e.g., expectedTitleTags []database.TagInfo and
expectedTitleProps map[string]string), and inside the mock.MatchedBy closure
verify that w.TitleTags equals expectedTitleTags and w.TitleProps equals
expectedTitleProps (use assert.ObjectsAreEqual or reflect.DeepEqual for the
comparisons). Ensure you still check w != nil and the existing Sentinel and
MediaTags checks so the matcher rejects writes missing the parent title
metadata.
pkg/database/scraper/gamelistxml/scraper.go (1)

934-939: ⚡ Quick win

Move companionStats to the top-level type/const section.

This new type is declared mid-file after many functions. Please move it near other type/const declarations for guideline compliance.

As per coding guidelines, **/*.go: "Define Go types and consts near the top of the file, before functions and methods".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper.go` around lines 934 - 939, Move the
companionStats type declaration out of the middle of the file and place it with
the other top-level types/consts near the top of the file (before any functions
or methods); specifically locate the companionStats type and cut/paste it into
the file’s existing type/const block so it sits alongside other package-level
type definitions, ensure no references need adjusting, then run gofmt/govet to
verify formatting and imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/methods/media_content.go`:
- Around line 49-52: The function mediaContentType currently checks
strings.TrimSpace(contentType) but returns the original untrimmed contentType;
update mediaContentType to return the trimmed value (use
strings.TrimSpace(contentType)) when the check passes, and similarly ensure any
fallback return (the text parameter) is also strings.TrimSpace(text) so no
whitespace-padded MIME values are exposed.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 1223-1231: The companionChildTags function currently appends raw
Region and Lang values which can create inconsistent tags (e.g., "USA, EUR");
update companionChildTags to normalize these fields the same way MapToDB does:
split CSV values, trim whitespace, lowercase (or apply the same normalization
function MapToDB uses), and append each resulting token as a separate
database.TagInfo with Type string(tags.TagTypeRegion) or
string(tags.TagTypeLang) respectively; locate companionChildTags and reuse or
mirror the normalization logic used elsewhere (e.g., the MapToDB helper or
tokenization utility) so region/lang produce consistent individual tags.

---

Nitpick comments:
In `@pkg/database/scraper/gamelistxml/scraper_test.go`:
- Around line 1381-1389: The matcher companionWriteMatcher currently only checks
Sentinel and MediaTags; modify it to also validate the parent title metadata on
the captured *database.ScrapeWrite*. Change the signature of
companionWriteMatcher to accept expected title metadata (e.g., expectedTitleTags
[]database.TagInfo and expectedTitleProps map[string]string), and inside the
mock.MatchedBy closure verify that w.TitleTags equals expectedTitleTags and
w.TitleProps equals expectedTitleProps (use assert.ObjectsAreEqual or
reflect.DeepEqual for the comparisons). Ensure you still check w != nil and the
existing Sentinel and MediaTags checks so the matcher rejects writes missing the
parent title metadata.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 934-939: Move the companionStats type declaration out of the
middle of the file and place it with the other top-level types/consts near the
top of the file (before any functions or methods); specifically locate the
companionStats type and cut/paste it into the file’s existing type/const block
so it sits alongside other package-level type definitions, ensure no references
need adjusting, then run gofmt/govet to verify formatting and imports.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 024e2c7a-f4ed-475e-ab56-937c7ab997f2

📥 Commits

Reviewing files that changed from the base of the PR and between 7570989 and 64454a9.

📒 Files selected for processing (8)
  • docs/scraper.md
  • pkg/api/methods/media_content.go
  • pkg/api/methods/media_image.go
  • pkg/api/methods/media_image_test.go
  • pkg/api/methods/media_meta.go
  • pkg/api/methods/media_meta_test.go
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go

Comment thread pkg/api/methods/media_content.go
Comment thread pkg/database/scraper/gamelistxml/scraper.go
@wizzomafizzo
Copy link
Copy Markdown
Member Author

Added core-only regression coverage for frontend issue ZaparooProject/zaparoo-frontend#161:

  • Companion XML with screenshot, titlescreen, boxart2d, boxart3d, and logo now asserts all five title image properties are written.
  • media.image with no imageTypes now has explicit default-auto coverage for path-backed Companion artwork.
  • fallback coverage proves Core returns the next available default artwork when boxart is missing.

Validation rerun:

  • go test ./pkg/database/scraper/...
  • go test ./pkg/api/methods/ -run 'MediaImage|Scrape'
  • task lint-fix
  • task test

@wizzomafizzo
Copy link
Copy Markdown
Member Author

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@wizzomafizzo
Copy link
Copy Markdown
Member Author

Added a follow-up fix for #794: regular gamelist.xml matching still uses concrete media paths, but now handles MiSTer zip-as-dir indexing by falling back from an unmatched .zip entry to exactly one indexed child path under that archive. Ambiguous multi-child archives are skipped rather than falling back to slug matching.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@wizzomafizzo
Copy link
Copy Markdown
Member Author

Added a follow-up for #808. Coverage now distinguishes nested explicit artwork (already supported), nested filesystem fallback artwork, ignored nested gamelist.xml files, and thumbnail-only artwork. Fixes include mirrored subfolder media fallback (e.g. media/images/Japan/Game.png), box2dfront aliases, and media.image thumbnail support/default fallback.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 502-517: When building loadRecordIndexes in the loop over
allMedia, you must apply the same non-force scraped-media gate to
MediaByTitleDBID as you do for MediaByPathFold: check scrapedIDs for m.DBID and
only add to indexes.MediaByTitleDBID[m.MediaTitleDBID] when the media is not
already scraped (same condition used for indexes.MediaByPathFold using
pathFoldKey). Update the loop that constructs loadRecordIndexes (the indexes
variable handling MediaByPathFold and MediaByTitleDBID) so both maps skip
appending already-scraped media when scrapedIDs indicates it was scraped
(preserving behavior when the force-mode branch is active).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f2a41b7-900b-4136-9726-4fc041e492b8

📥 Commits

Reviewing files that changed from the base of the PR and between dde6bcc and 460176c.

📒 Files selected for processing (10)
  • docs/scraper.md
  • pkg/api/methods/media_content.go
  • pkg/api/methods/media_content_test.go
  • pkg/api/methods/media_image.go
  • pkg/api/methods/media_image_test.go
  • pkg/api/methods/media_meta.go
  • pkg/api/methods/media_meta_test.go
  • pkg/api/models/responses.go
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/api/methods/media_content_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/api/methods/media_content.go

Comment thread pkg/database/scraper/gamelistxml/scraper.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/scraper.md (1)

210-210: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documentation inconsistency with PR changes regarding stale property handling.

Line 210 states "Stale file path properties are removed automatically," but the AI summary indicates this PR changes stale-image handling to "in-memory logging" with "no DB deletion." The documentation should reflect that stale properties are now logged rather than deleted.

📝 Suggested correction
-`media.image` accepts one media ref plus image type preferences such as `image`, `boxart`, `boxart3d`, `screenshot`, `wheel`, `titleshot`, `map`, `marquee`, and `fanart`. These resolve to canonical image property tags; for example `boxart` becomes `property:image-boxart` and `image` becomes `property:image-image`. Media-level properties are preferred over title-level properties for the same type. Stale file path properties are removed automatically and lookup falls through to the next available source.
+`media.image` accepts one media ref plus image type preferences such as `image`, `boxart`, `boxart3d`, `screenshot`, `wheel`, `titleshot`, `map`, `marquee`, and `fanart`. These resolve to canonical image property tags; for example `boxart` becomes `property:image-boxart` and `image` becomes `property:image-image`. Media-level properties are preferred over title-level properties for the same type. Stale file path properties are logged and lookup falls through to the next available source.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/scraper.md` at line 210, Update the docs text that describes stale-file
handling for media.image to reflect the PR change: instead of saying "Stale file
path properties are removed automatically" explain that stale image properties
are now recorded via in-memory logging (no DB deletions) and that lookup still
falls through to the next available source; reference the media.image field and
the canonical image property tags (e.g., property:image-boxart,
property:image-image) so readers know which properties this new logging behavior
applies to.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/scraper.md`:
- Line 210: Update the docs text that describes stale-file handling for
media.image to reflect the PR change: instead of saying "Stale file path
properties are removed automatically" explain that stale image properties are
now recorded via in-memory logging (no DB deletions) and that lookup still falls
through to the next available source; reference the media.image field and the
canonical image property tags (e.g., property:image-boxart,
property:image-image) so readers know which properties this new logging behavior
applies to.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 33eb049d-32fe-45d8-b8e1-4c7fca358f3f

📥 Commits

Reviewing files that changed from the base of the PR and between 460176c and 7dd20bb.

📒 Files selected for processing (3)
  • docs/scraper.md
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go

@wizzomafizzo wizzomafizzo merged commit 0ddd02e into main May 29, 2026
12 checks passed
@wizzomafizzo wizzomafizzo deleted the fix/scraper-companion-semantics branch May 29, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant