Background
Surfaced during investigation of #180. During a multi-hour MusicBrainz outage (503 Server Maintenance), every Apple Music match call hits the MB API for the release tracklist, retries 3× with exponential backoff (1s + 4s + 8s = ~13s of waits), and finally gives up — falling back to the local recording_releases presence-check path.
For an interactive admin diagnose this adds ~12s of dead time per click. For the background apple/match_song worker it's amortized across the job, but the cumulative wait across an MB outage day is real.
Concrete server log from a single diagnose during the outage:
2026-05-11 18:07:50 Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:07:50 MusicBrainz service unavailable (503), will retry...
2026-05-11 18:07:50 BACKOFF: tracklist fetch retry 2/3, waiting 4s before retry
2026-05-11 18:07:54 Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:07:54 MusicBrainz service unavailable (503), will retry...
2026-05-11 18:07:54 BACKOFF: tracklist fetch retry 3/3, waiting 8s before retry
2026-05-11 18:08:02 Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:08:02 All retry attempts failed (503)
13 seconds per diagnose call, and the next diagnose pays the same toll fresh.
Proposed fix
Add a process-local circuit breaker in integrations/musicbrainz/client.py (or a thin wrapper). When get_release_tracklist (or any MB request) exhausts its retries on 503/connection-error, stash a (failed_at, retry_after) marker on the class. While the marker is fresh (say 5 minutes), subsequent calls return None immediately instead of running through the retry ladder.
Sketch:
class MusicBrainzSearcher:
_outage_marker = None # class-level, shared across instances
_OUTAGE_TTL_SECONDS = 300
def _is_in_outage_window(self) -> bool:
if not self._outage_marker:
return False
return time.time() - self._outage_marker < self._OUTAGE_TTL_SECONDS
def get_release_tracklist(self, release_id, max_retries=3):
if self._is_in_outage_window():
logger.debug("MB circuit breaker open; short-circuiting tracklist fetch")
return None
# ... existing retry loop ...
# on full failure:
MusicBrainzSearcher._outage_marker = time.time()
return None
Cost: one retry storm per worker process per ~5-minute window during an outage, instead of one per call.
Scope notes
- Process-local, not cross-worker. A class-level dict is fine here — every worker process pays a fresh retry storm on its first call after restart, but that's acceptable. A shared/distributed circuit breaker (Redis, DB) is overengineering for this.
- Short TTL. 5 minutes balances "don't keep retrying during a multi-hour outage" against "don't stay broken longer than needed when MB comes back." Could be tuned.
- Respect Retry-After. When MB returns a 429 with a
Retry-After header, prefer that over the fixed TTL.
- Other MB endpoints.
get_release_tracklist is the obvious caller, but get_release_details, search_releases, search_artists, etc. all share the same risk. Probably worth installing the breaker at the request layer (e.g. wrapping self.session.get) rather than per-endpoint.
Why this is "nice to have," not urgent
Correctness is fine — the gate falls back to the local presence check during MB outages (#180 work). This is purely a latency / log-noise optimization for outage windows.
Acceptance
- Successive Apple Music diagnose calls during an MB outage do not each pay the full 13s retry ladder.
- When MB comes back, the breaker reopens within the TTL window so we don't stay degraded longer than needed.
- A debug log line indicates when the breaker is short-circuiting a call, so operators can tell it from a genuine MB outage.
Background
Surfaced during investigation of #180. During a multi-hour MusicBrainz outage (503 Server Maintenance), every Apple Music match call hits the MB API for the release tracklist, retries 3× with exponential backoff (1s + 4s + 8s = ~13s of waits), and finally gives up — falling back to the local
recording_releasespresence-check path.For an interactive admin diagnose this adds ~12s of dead time per click. For the background
apple/match_songworker it's amortized across the job, but the cumulative wait across an MB outage day is real.Concrete server log from a single diagnose during the outage:
13 seconds per diagnose call, and the next diagnose pays the same toll fresh.
Proposed fix
Add a process-local circuit breaker in
integrations/musicbrainz/client.py(or a thin wrapper). Whenget_release_tracklist(or any MB request) exhausts its retries on 503/connection-error, stash a(failed_at, retry_after)marker on the class. While the marker is fresh (say 5 minutes), subsequent calls returnNoneimmediately instead of running through the retry ladder.Sketch:
Cost: one retry storm per worker process per ~5-minute window during an outage, instead of one per call.
Scope notes
Retry-Afterheader, prefer that over the fixed TTL.get_release_tracklistis the obvious caller, butget_release_details,search_releases,search_artists, etc. all share the same risk. Probably worth installing the breaker at the request layer (e.g. wrappingself.session.get) rather than per-endpoint.Why this is "nice to have," not urgent
Correctness is fine — the gate falls back to the local presence check during MB outages (#180 work). This is purely a latency / log-noise optimization for outage windows.
Acceptance