You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue #162 stood up the Apple Music catalog refresh as a managed service. The full catalog at this point is ~84GB of parquet, ~31GB of indexed DuckDB, with 60M albums and 196M songs. Building the song-album index requires >8GB of RAM and a 16GB Render worker — a real cost just to make song-by-album lookups fast.
The vast majority of that catalog is not jazz. Apple's Feed API doesn't support genre-filtered downloads (you have to grab the whole thing), but we can filter at index-build time, before the data ever materializes into DuckDB tables.
Goal
Shrink the indexed catalog to roughly the jazz subset, so:
Index builds fit comfortably in 2-4GB worker memory
Refresh runs in minutes instead of hours
We don't need to provision a 16GB instance just for catalog rebuild
Disk savings are minimal (we still download the full 84GB parquet each refresh because Apple gives us the whole feed), but memory + time savings are large.
Approach
Two complementary filters in backend/integrations/apple_music/catalog_index.pybuild_index()CREATE TABLE clauses:
Genre tag filter. Apple's parquet has a genres column. Add WHERE list_has_any(genres, ARRAY[...]) with a curated list of jazz-relevant tags (Jazz, Bebop, Vocal Jazz, Big Band, Hard Bop, Cool Jazz, Fusion, etc.). Jazz is probably 1-3% of the catalog, so 60M albums → ~1M, 196M songs → ~3-5M.
Known-performer artist ID filter. We already have a performers table in Postgres with names and MusicBrainz IDs. Backfill an apple_artist_id column (one-time lookup), then include OR primaryArtists[1].id IN (SELECT apple_artist_id FROM performers). Catches jazz performers whose albums aren't tagged with a jazz genre (soundtracks, jams, mistagged albums).
Both controlled by env vars so the filter is configurable per deploy:
We can't know what genre tags exist without inspecting the catalog. Use the read-only REPL added in commit 3bd9563:
cd /opt/render/project/src/backend
python scripts/onetime_scripts/apple_catalog_repl.py
Then:
-- Inventory jazz-adjacent tagsSELECT g, COUNT(*) FROM (
SELECT unnest(genres) AS g FROM albums
) WHERE g ILIKE '%jazz%'OR g ILIKE '%bebop%'OR g ILIKE '%swing%'OR g ILIKE '%fusion%'GROUP BY g ORDER BY2DESCLIMIT50;
-- Cross-check: any well-known jazz albums without a jazz genre tag?SELECTa.name, a.artist_name, a.genresFROM albums a
WHERELOWER(a.artist_name) IN ('miles davis','john coltrane','bill evans','thelonious monk','charlie parker')
ORDER BY artist_name LIMIT50;
That gives us the jazz tag taxonomy + a sense of how often we'd miss something with genre-only filtering.
Implementation sketch
Confirm parquet schema has genres column (verify via REPL)
Compile curated jazz-tag list from REPL exploration
Add WHERE clauses to the CREATE TABLE albums and CREATE TABLE songs paths in build_index()
Add the two env vars + sensible defaults (probably both off by default to preserve current behavior; opt in by setting them on the worker)
Background
Issue #162 stood up the Apple Music catalog refresh as a managed service. The full catalog at this point is ~84GB of parquet, ~31GB of indexed DuckDB, with 60M albums and 196M songs. Building the song-album index requires >8GB of RAM and a 16GB Render worker — a real cost just to make song-by-album lookups fast.
The vast majority of that catalog is not jazz. Apple's Feed API doesn't support genre-filtered downloads (you have to grab the whole thing), but we can filter at index-build time, before the data ever materializes into DuckDB tables.
Goal
Shrink the indexed catalog to roughly the jazz subset, so:
Disk savings are minimal (we still download the full 84GB parquet each refresh because Apple gives us the whole feed), but memory + time savings are large.
Approach
Two complementary filters in
backend/integrations/apple_music/catalog_index.pybuild_index()CREATE TABLEclauses:Genre tag filter. Apple's parquet has a
genrescolumn. AddWHERE list_has_any(genres, ARRAY[...])with a curated list of jazz-relevant tags (Jazz, Bebop, Vocal Jazz, Big Band, Hard Bop, Cool Jazz, Fusion, etc.). Jazz is probably 1-3% of the catalog, so 60M albums → ~1M, 196M songs → ~3-5M.Known-performer artist ID filter. We already have a
performerstable in Postgres with names and MusicBrainz IDs. Backfill anapple_artist_idcolumn (one-time lookup), then includeOR primaryArtists[1].id IN (SELECT apple_artist_id FROM performers). Catches jazz performers whose albums aren't tagged with a jazz genre (soundtracks, jams, mistagged albums).Both controlled by env vars so the filter is configurable per deploy:
APPLE_CATALOG_FILTER_GENRES=Jazz,Bebop,Vocal Jazz,...(comma-separated)APPLE_CATALOG_FILTER_TO_KNOWN_ARTISTS=true|falseDiscovery step (chicken-and-egg)
We can't know what genre tags exist without inspecting the catalog. Use the read-only REPL added in commit 3bd9563:
cd /opt/render/project/src/backend python scripts/onetime_scripts/apple_catalog_repl.pyThen:
That gives us the jazz tag taxonomy + a sense of how often we'd miss something with genre-only filtering.
Implementation sketch
genrescolumn (verify via REPL)WHEREclauses to theCREATE TABLE albumsandCREATE TABLE songspaths inbuild_index()apple_artist_idcolumn + backfill to theperformerstable for filter Searching for artist or song returns errors during type ahead #2Out of scope
References
backend/research_worker/handlers/apple_catalog.pybackend/integrations/apple_music/catalog_index.pybackend/scripts/onetime_scripts/apple_catalog_repl.py(commit 3bd9563)