MusicBrainz outage circuit breaker — cap retry storm during multi-hour 503 windows

## Background

Surfaced during investigation of #180. During a multi-hour MusicBrainz outage (503 Server Maintenance), every Apple Music match call hits the MB API for the release tracklist, retries 3× with exponential backoff (1s + 4s + 8s = ~13s of waits), and finally gives up — falling back to the local `recording_releases` presence-check path.

For an interactive admin diagnose this adds ~12s of dead time per click. For the background `apple/match_song` worker it's amortized across the job, but the cumulative wait across an MB outage day is real.

Concrete server log from a single diagnose during the outage:

```
2026-05-11 18:07:50  Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:07:50  MusicBrainz service unavailable (503), will retry...
2026-05-11 18:07:50  BACKOFF: tracklist fetch retry 2/3, waiting 4s before retry
2026-05-11 18:07:54  Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:07:54  MusicBrainz service unavailable (503), will retry...
2026-05-11 18:07:54  BACKOFF: tracklist fetch retry 3/3, waiting 8s before retry
2026-05-11 18:08:02  Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:08:02  All retry attempts failed (503)
```

13 seconds per diagnose call, and the next diagnose pays the same toll fresh.

## Proposed fix

Add a process-local circuit breaker in `integrations/musicbrainz/client.py` (or a thin wrapper). When `get_release_tracklist` (or any MB request) exhausts its retries on 503/connection-error, stash a `(failed_at, retry_after)` marker on the class. While the marker is fresh (say 5 minutes), subsequent calls return `None` immediately instead of running through the retry ladder.

Sketch:

```python
class MusicBrainzSearcher:
    _outage_marker = None  # class-level, shared across instances
    _OUTAGE_TTL_SECONDS = 300

    def _is_in_outage_window(self) -> bool:
        if not self._outage_marker:
            return False
        return time.time() - self._outage_marker < self._OUTAGE_TTL_SECONDS

    def get_release_tracklist(self, release_id, max_retries=3):
        if self._is_in_outage_window():
            logger.debug("MB circuit breaker open; short-circuiting tracklist fetch")
            return None
        # ... existing retry loop ...
        # on full failure:
        MusicBrainzSearcher._outage_marker = time.time()
        return None
```

Cost: one retry storm per worker process per ~5-minute window during an outage, instead of one per call.

## Scope notes

- **Process-local, not cross-worker.** A class-level dict is fine here — every worker process pays a fresh retry storm on its first call after restart, but that's acceptable. A shared/distributed circuit breaker (Redis, DB) is overengineering for this.
- **Short TTL.** 5 minutes balances "don't keep retrying during a multi-hour outage" against "don't stay broken longer than needed when MB comes back." Could be tuned.
- **Respect Retry-After.** When MB returns a 429 with a `Retry-After` header, prefer that over the fixed TTL.
- **Other MB endpoints.** `get_release_tracklist` is the obvious caller, but `get_release_details`, `search_releases`, `search_artists`, etc. all share the same risk. Probably worth installing the breaker at the request layer (e.g. wrapping `self.session.get`) rather than per-endpoint.

## Why this is "nice to have," not urgent

Correctness is fine — the gate falls back to the local presence check during MB outages (#180 work). This is purely a latency / log-noise optimization for outage windows.

## Acceptance

- Successive Apple Music diagnose calls during an MB outage do not each pay the full 13s retry ladder.
- When MB comes back, the breaker reopens within the TTL window so we don't stay degraded longer than needed.
- A debug log line indicates when the breaker is short-circuiting a call, so operators can tell it from a genuine MB outage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MusicBrainz outage circuit breaker — cap retry storm during multi-hour 503 windows #190

Background

Proposed fix

Scope notes

Why this is "nice to have," not urgent

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MusicBrainz outage circuit breaker — cap retry storm during multi-hour 503 windows #190

Description

Background

Proposed fix

Scope notes

Why this is "nice to have," not urgent

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions