fix(scraper): tighten gamelist metadata scraping by wizzomafizzo · Pull Request #858 · ZaparooProject/zaparoo-core

wizzomafizzo · 2026-05-28T11:23:09Z

Summary

update scraper subsystem docs to match current platform scraper architecture
tighten ZaparooCompanion handling with atomic scrape writes, sentinels, safer matching, and aggregate progress
infer content types for path-backed media metadata/images
import regular gamelist region/lang as per-media tags using media-row matching

Validation

task lint-fix
task test

Summary by CodeRabbit

Documentation
- Major rewrite clarifying scraper vs scanner roles, gamelist.xml semantics, companion-entry behavior, tag/property rules, sentinel/retry semantics, lifecycle, and focused test commands.
New Features
- Added scrape controls (status, cancel, resume) and expanded progress fields; media meta responses include AvailableImageTypes.
Improvements
- Prefer thumbnail, infer content-type/extension from file paths, expanded image mapping/fallbacks, path-normalized matching, rewritten companion handling, and mark stale image properties instead of deleting.
Tests
- Added/updated targeted tests for image inference, preference/fallbacks, scraper matching, companion processing, and scrape APIs.

coderabbitai · 2026-05-28T11:23:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 09c13953-975a-4183-8be8-a9cbb507c4fa

📥 Commits

Reviewing files that changed from the base of the PR and between 7dd20bb and 06bdafc.

📒 Files selected for processing (1)

docs/scraper.md

✅ Files skipped from review due to trivial changes (1)

docs/scraper.md

📝 Walkthrough

Walkthrough

Adds MIME inference for path-backed media properties, extends media.meta with available image types, prefers thumbnail in image selection, changes stale-image handling to in-memory logging, refactors GamelistXML scraper for folded-path matching, ROM-level media tags, artwork fallback (including external roots), companion atomic writes, and updates tests/docs.

Changes

Content Type & Media API

Layer / File(s)	Summary
Content type helper and tests `pkg/api/methods/media_content.go`, `pkg/api/methods/media_content_test.go`	Adds `mediaContentType` to derive MIME type from file extension when ContentType is empty and a unit test verifying trimming behavior.
Image selection, stale handling & debug logs `pkg/api/methods/media_image.go`, `pkg/api/methods/media_image_test.go`	Adds `thumbnail` to default preference order, computes sorted available image types, derives ContentType/Extension via `mediaContentType`, logs stale image properties instead of deleting them, simplifies file-backed loader signature, and improves debug logging and tests for preference selection and fallback.
Media meta response & helpers `pkg/api/methods/media_meta.go`, `pkg/api/methods/media_meta_test.go`, `pkg/api/models/responses.go`	Derive property ContentType/Extension via `mediaContentType`, add `AvailableImageTypes` to media/meta responses, and implement `availableImageTypes` helper with tests.

Gamelist XML Scraper Path-Based Refactor

Layer / File(s)	Summary
Scraper subsystem documentation `docs/scraper.md`	Docs rewritten to describe enrichment-only scrapers, `media.scrape` lifecycle (status/cancel/resume), GamelistXML loop semantics, expanded field mappings, filesystem fallback rules, and ZaparooCompanion behavior.
Scraper structs & platform roots `pkg/database/scraper/gamelistxml/scraper.go`	Extend `GamelistRecord` with `MatchKind`/`MediaLevelWriteSafe`, expand mediaDirCandidates, add `loadRecordIndexes`/`companionStats`, and initialize platform-specific external asset roots.
LoadRecords folded-path matching `pkg/database/scraper/gamelistxml/scraper.go`	Change `LoadRecords` to accept `loadRecordIndexes` and match `<game>` entries by folded/normalized ROM paths (slug→title then path fallback), handle zip-as-dir matches, remove matched indexes to avoid duplicates, and emit match classification flags.
scrapeLoop integration & progress `pkg/database/scraper/gamelistxml/scraper.go`	Process companions first (returns `companionStats`), build per-system indexes excluding already-scraped unless `Force`, call `LoadRecords`, offset progress by companion counters, conditionally omit ROM-level metadata when `MediaLevelWriteSafe` is false, and aggregate totals.
MapToDB ROM tags & artwork fallback `pkg/database/scraper/gamelistxml/scraper.go`	Normalize `Region`/`Lang` into ROM-level `MediaTags`, compute `artworkFallbackNames` for ROM-relative lookup, thread fallback candidates into filesystem probing, and return `MediaTags`/`MediaProps`.
Path resolution & artwork probing helpers `pkg/database/scraper/gamelistxml/scraper.go`	Add `appendCSVTags`, `resolveESAssetPath`/`resolveESPathAbs`/`pathWithinRoot`, and rework artwork probing to support external roots, folded-key matching, and zip-as-dir semantics.
Companion processing rewrite `pkg/database/scraper/gamelistxml/scraper.go`	Detect companion games, convert parent `MediaProps`→`TitleProps`, preload media-by-title, resolve children by `.slug`/exact folded path/filename-suffix with ambiguity skipping, write one `ApplyScrapeResult` per child (sentinel + tags/props), and return `companionStats`.
Scraper test suite refactor `pkg/database/scraper/gamelistxml/scraper_test.go`	Remove `mediascanner` dependency, add indexed helpers, expand `LoadRecords` path-matching tests (subfolders, zip-as-dir, slug/path fallback), move many artwork assertions to `MediaProps`, add external-asset-root tests, introduce sentinel-based matchers, and update `scrapeLoop` tests to mock `GetMediaBySystemID`/`GetScrapedMediaIDs` and assert writes/counters.

Sequence Diagram (high-level scraper write flow):

sequenceDiagram
  participant API as media.scrape (API)
  participant Scraper as GamelistXMLScraper
  participant Loader as LoadRecords
  participant Mapper as MapToDB
  participant DB as database.ApplyScrapeResult

  API->>Scraper: start scrape request
  Scraper->>Loader: loadRecordIndexes (media/title indexes)
  Loader->>Mapper: Map matched game -> MapResult (TitleProps/MediaProps, MediaTags)
  Mapper->>DB: ApplyScrapeResult(scrape write with sentinel + tags/props)
  DB-->>Scraper: write result (ok/error)
  Scraper-->>API: progress/status updates

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

ZaparooProject/zaparoo-core#776: Related changes to gamelistxml matching logic and path/filename indexing.
ZaparooProject/zaparoo-core#789: Overlapping scraper refactor affecting LoadRecords and MapToDB semantics.
ZaparooProject/zaparoo-core#793: Related to slug/title generation and mediascanner-driven name behavior.

"🐰 I sniffed the filenames with care,
Slugs and paths now fold and pair,
Thumbnails first, types found by name,
Companions written in one gentle frame,
Docs sing the steps and tests proclaim."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.80% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(scraper): tighten gamelist metadata scraping' directly describes the main objective—to tighten gamelist metadata scraping with improved companion handling, content-type inference, and region/language tag imports.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/scraper-companion-semantics

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sentry · 2026-05-28T11:30:10Z

Codecov Report

❌ Patch coverage is 74.86819% with 143 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pkg/database/scraper/gamelistxml/scraper.go	70.92%	113 Missing and 28 partials ⚠️
pkg/api/methods/media_image.go	95.12%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

pkg/database/scraper/gamelistxml/scraper_test.go (1)
1381-1389: ⚡ Quick win

Strengthen companionWriteMatcher to validate parent title metadata too.

The matcher currently ignores TitleTags and TitleProps, so atomic-write regressions on companion parent metadata would still pass.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper_test.go` around lines 1381 - 1389,
The matcher companionWriteMatcher currently only checks Sentinel and MediaTags;
modify it to also validate the parent title metadata on the captured
*database.ScrapeWrite*. Change the signature of companionWriteMatcher to accept
expected title metadata (e.g., expectedTitleTags []database.TagInfo and
expectedTitleProps map[string]string), and inside the mock.MatchedBy closure
verify that w.TitleTags equals expectedTitleTags and w.TitleProps equals
expectedTitleProps (use assert.ObjectsAreEqual or reflect.DeepEqual for the
comparisons). Ensure you still check w != nil and the existing Sentinel and
MediaTags checks so the matcher rejects writes missing the parent title
metadata.
pkg/database/scraper/gamelistxml/scraper.go (1)
934-939: ⚡ Quick win

Move companionStats to the top-level type/const section.

This new type is declared mid-file after many functions. Please move it near other type/const declarations for guideline compliance.

As per coding guidelines, **/*.go: "Define Go types and consts near the top of the file, before functions and methods".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper.go` around lines 934 - 939, Move the
companionStats type declaration out of the middle of the file and place it with
the other top-level types/consts near the top of the file (before any functions
or methods); specifically locate the companionStats type and cut/paste it into
the file’s existing type/const block so it sits alongside other package-level
type definitions, ensure no references need adjusting, then run gofmt/govet to
verify formatting and imports.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/methods/media_content.go`:
- Around line 49-52: The function mediaContentType currently checks
strings.TrimSpace(contentType) but returns the original untrimmed contentType;
update mediaContentType to return the trimmed value (use
strings.TrimSpace(contentType)) when the check passes, and similarly ensure any
fallback return (the text parameter) is also strings.TrimSpace(text) so no
whitespace-padded MIME values are exposed.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 1223-1231: The companionChildTags function currently appends raw
Region and Lang values which can create inconsistent tags (e.g., "USA, EUR");
update companionChildTags to normalize these fields the same way MapToDB does:
split CSV values, trim whitespace, lowercase (or apply the same normalization
function MapToDB uses), and append each resulting token as a separate
database.TagInfo with Type string(tags.TagTypeRegion) or
string(tags.TagTypeLang) respectively; locate companionChildTags and reuse or
mirror the normalization logic used elsewhere (e.g., the MapToDB helper or
tokenization utility) so region/lang produce consistent individual tags.

---

Nitpick comments:
In `@pkg/database/scraper/gamelistxml/scraper_test.go`:
- Around line 1381-1389: The matcher companionWriteMatcher currently only checks
Sentinel and MediaTags; modify it to also validate the parent title metadata on
the captured *database.ScrapeWrite*. Change the signature of
companionWriteMatcher to accept expected title metadata (e.g., expectedTitleTags
[]database.TagInfo and expectedTitleProps map[string]string), and inside the
mock.MatchedBy closure verify that w.TitleTags equals expectedTitleTags and
w.TitleProps equals expectedTitleProps (use assert.ObjectsAreEqual or
reflect.DeepEqual for the comparisons). Ensure you still check w != nil and the
existing Sentinel and MediaTags checks so the matcher rejects writes missing the
parent title metadata.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 934-939: Move the companionStats type declaration out of the
middle of the file and place it with the other top-level types/consts near the
top of the file (before any functions or methods); specifically locate the
companionStats type and cut/paste it into the file’s existing type/const block
so it sits alongside other package-level type definitions, ensure no references
need adjusting, then run gofmt/govet to verify formatting and imports.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 024e2c7a-f4ed-475e-ab56-937c7ab997f2

📥 Commits

Reviewing files that changed from the base of the PR and between 7570989 and 64454a9.

📒 Files selected for processing (8)

docs/scraper.md
pkg/api/methods/media_content.go
pkg/api/methods/media_image.go
pkg/api/methods/media_image_test.go
pkg/api/methods/media_meta.go
pkg/api/methods/media_meta_test.go
pkg/database/scraper/gamelistxml/scraper.go
pkg/database/scraper/gamelistxml/scraper_test.go

wizzomafizzo · 2026-05-28T11:48:14Z

Added core-only regression coverage for frontend issue ZaparooProject/zaparoo-frontend#161:

Companion XML with screenshot, titlescreen, boxart2d, boxart3d, and logo now asserts all five title image properties are written.
media.image with no imageTypes now has explicit default-auto coverage for path-backed Companion artwork.
fallback coverage proves Core returns the next available default artwork when boxart is missing.

Validation rerun:

go test ./pkg/database/scraper/...
go test ./pkg/api/methods/ -run 'MediaImage|Scrape'
task lint-fix
task test

wizzomafizzo · 2026-05-28T11:49:26Z

zaparoo-mister_arm-4fd618ca-dev.zip

coderabbitai · 2026-05-28T11:52:25Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T14:55:02Z

Actionable comments posted: 0

wizzomafizzo · 2026-05-28T14:55:05Z

Added a follow-up fix for #794: regular gamelist.xml matching still uses concrete media paths, but now handles MiSTer zip-as-dir indexing by falling back from an unmatched .zip entry to exactly one indexed child path under that archive. Ambiguous multi-child archives are skipped rather than falling back to slug matching.

coderabbitai · 2026-05-28T16:09:34Z

Actionable comments posted: 0

wizzomafizzo · 2026-05-28T16:09:55Z

Added a follow-up for #808. Coverage now distinguishes nested explicit artwork (already supported), nested filesystem fallback artwork, ignored nested gamelist.xml files, and thumbnail-only artwork. Fixes include mirrored subfolder media fallback (e.g. media/images/Japan/Game.png), box2dfront aliases, and media.image thumbnail support/default fallback.

coderabbitai · 2026-05-28T22:20:57Z

Actionable comments posted: 0

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 502-517: When building loadRecordIndexes in the loop over
allMedia, you must apply the same non-force scraped-media gate to
MediaByTitleDBID as you do for MediaByPathFold: check scrapedIDs for m.DBID and
only add to indexes.MediaByTitleDBID[m.MediaTitleDBID] when the media is not
already scraped (same condition used for indexes.MediaByPathFold using
pathFoldKey). Update the loop that constructs loadRecordIndexes (the indexes
variable handling MediaByPathFold and MediaByTitleDBID) so both maps skip
appending already-scraped media when scrapedIDs indicates it was scraped
(preserving behavior when the force-mode branch is active).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f2a41b7-900b-4136-9726-4fc041e492b8

📥 Commits

Reviewing files that changed from the base of the PR and between dde6bcc and 460176c.

📒 Files selected for processing (10)

docs/scraper.md
pkg/api/methods/media_content.go
pkg/api/methods/media_content_test.go
pkg/api/methods/media_image.go
pkg/api/methods/media_image_test.go
pkg/api/methods/media_meta.go
pkg/api/methods/media_meta_test.go
pkg/api/models/responses.go
pkg/database/scraper/gamelistxml/scraper.go
pkg/database/scraper/gamelistxml/scraper_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/api/methods/media_content_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/api/methods/media_content.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/scraper.md (1)

210-210: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documentation inconsistency with PR changes regarding stale property handling.

Line 210 states "Stale file path properties are removed automatically," but the AI summary indicates this PR changes stale-image handling to "in-memory logging" with "no DB deletion." The documentation should reflect that stale properties are now logged rather than deleted.

📝 Suggested correction

-`media.image` accepts one media ref plus image type preferences such as `image`, `boxart`, `boxart3d`, `screenshot`, `wheel`, `titleshot`, `map`, `marquee`, and `fanart`. These resolve to canonical image property tags; for example `boxart` becomes `property:image-boxart` and `image` becomes `property:image-image`. Media-level properties are preferred over title-level properties for the same type. Stale file path properties are removed automatically and lookup falls through to the next available source.
+`media.image` accepts one media ref plus image type preferences such as `image`, `boxart`, `boxart3d`, `screenshot`, `wheel`, `titleshot`, `map`, `marquee`, and `fanart`. These resolve to canonical image property tags; for example `boxart` becomes `property:image-boxart` and `image` becomes `property:image-image`. Media-level properties are preferred over title-level properties for the same type. Stale file path properties are logged and lookup falls through to the next available source.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/scraper.md` at line 210, Update the docs text that describes stale-file
handling for media.image to reflect the PR change: instead of saying "Stale file
path properties are removed automatically" explain that stale image properties
are now recorded via in-memory logging (no DB deletions) and that lookup still
falls through to the next available source; reference the media.image field and
the canonical image property tags (e.g., property:image-boxart,
property:image-image) so readers know which properties this new logging behavior
applies to.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/scraper.md`:
- Line 210: Update the docs text that describes stale-file handling for
media.image to reflect the PR change: instead of saying "Stale file path
properties are removed automatically" explain that stale image properties are
now recorded via in-memory logging (no DB deletions) and that lookup still falls
through to the next available source; reference the media.image field and the
canonical image property tags (e.g., property:image-boxart,
property:image-image) so readers know which properties this new logging behavior
applies to.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 33eb049d-32fe-45d8-b8e1-4c7fca358f3f

📥 Commits

Reviewing files that changed from the base of the PR and between 460176c and 7dd20bb.

📒 Files selected for processing (3)

docs/scraper.md
pkg/database/scraper/gamelistxml/scraper.go
pkg/database/scraper/gamelistxml/scraper_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/database/scraper/gamelistxml/scraper.go
pkg/database/scraper/gamelistxml/scraper_test.go

wizzomafizzo added 5 commits May 28, 2026 17:11

docs: update scraper subsystem documentation

66e5819

fix(scraper): tighten companion entry handling

81592bc

fix(api): infer content types for media paths

0983a5d

fix(scraper): import gamelist region and language tags

4edd549

fix(scraper): aggregate companion progress counts

64454a9

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread pkg/api/methods/media_content.go

Comment thread pkg/database/scraper/gamelistxml/scraper.go

test(scraper): cover companion default artwork selection

4fd618c

fix(scraper): match zip-as-dir gamelist entries

ee7515a

wizzomafizzo added 2 commits May 29, 2026 00:02

fix(scraper): show artwork for nested games

b961f52

style(scraper): wrap nested artwork fallback lines

dde6bcc

fix(scraper): normalize companion child tags

3c3a1cc

wizzomafizzo added 4 commits May 29, 2026 06:37

fix(scraper): repair companion image metadata

96a2a40

fix(api): preserve image metadata on read

5621e37

fix(scraper): restore hybrid gamelist matching

0cfaee2

fix(scraper): allow mister external assets

460176c

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread pkg/database/scraper/gamelistxml/scraper.go

wizzomafizzo added 2 commits May 29, 2026 09:32

fix(scraper): clarify gamelist scrape semantics

c4faed9

fix(scraper): satisfy gamelist lint checks

7dd20bb

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

docs(scraper): clarify stale image lookup

06bdafc

wizzomafizzo merged commit 0ddd02e into main May 29, 2026
12 checks passed

wizzomafizzo deleted the fix/scraper-companion-semantics branch May 29, 2026 02:29

coderabbitai Bot mentioned this pull request May 29, 2026

feat(media): collapse singleton media containers #860

Merged

Uh oh!

Conversation

wizzomafizzo commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly Related PRs

❌ Failed checks (1 warning)

Uh oh!

sentry Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wizzomafizzo commented May 28, 2026

Uh oh!

wizzomafizzo commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

wizzomafizzo commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

wizzomafizzo commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wizzomafizzo commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

sentry Bot commented May 28, 2026 •

edited

Loading