Skip to content

metal : wind down leftover residency sets at teardown instead of aborting#24368

Open
AlexCherrypi wants to merge 1 commit into
ggml-org:masterfrom
AlexCherrypi:metal-no-abort-on-quit
Open

metal : wind down leftover residency sets at teardown instead of aborting#24368
AlexCherrypi wants to merge 1 commit into
ggml-org:masterfrom
AlexCherrypi:metal-no-abort-on-quit

Conversation

@AlexCherrypi

Copy link
Copy Markdown

What

On macOS 15+ (Apple Silicon), ggml_metal_rsets_free() does GGML_ASSERT([rsets->data count] == 0), which calls abort() when the Metal device is torn down while residency sets are still registered. The device is freed from a C++ static destructor at process exit (ggml_metal_device_get's function-local static vector), so any app embedding the Metal backend that exits without freeing every Metal buffer first crashes on every quit.

Why it happens

A residency set is added in ggml_metal_buffer_init_* and removed from the collection in exactly one place — ggml_metal_buffer_free(). An application that lets the OS reclaim its model/weights on exit (a common, historically fine pattern) never calls ggml_backend_buffer_free for those buffers, so the collection is non-empty when the device's static destructor runs ggml_metal_rsets_free(), and the assert fires.

The device does not own those buffers and cannot free them from its destructor, so the assert can't be made to legitimately hold from within ggml_metal_rsets_free().

Fix

Make teardown defensive instead of aborting:

  1. stop the keep-alive heartbeat (existing d_stop + dispatch_group_wait),
  2. wind down residency on any leftover sets — endResidency + removeAllAllocations, mirroring ggml_metal_buffer_rset_free() but without -release (each set is still owned by its not-yet-freed buffer, so releasing here would over-release),
  3. then release the collection as before.

The backing buffers are reclaimed by the OS as the process exits. No behavior change when all buffers were freed — the array is empty and the loop is a no-op. Guarded by the existing GGML_METAL_HAS_RESIDENCY_SETS + @available(macOS 15.0, …).

Notes

Happy to adjust — feedback welcome.

…ting

ggml_metal_rsets_free() did GGML_ASSERT([rsets->data count] == 0) and so called
abort() when the Metal device is torn down (a C++ static destructor at process
exit) while residency sets are still registered. On macOS 15+ this crashes the
app on every quit: a residency set is removed from the collection only by
ggml_metal_buffer_free(), so an app that exits without freeing every buffer
(letting the OS reclaim the model on quit) leaves sets registered.

The device does not own the buffers and cannot free them from its destructor, so
make teardown defensive instead: stop the keep-alive heartbeat, then wind down
residency on any leftover sets (endResidency + removeAllAllocations, mirroring
ggml_metal_buffer_rset_free but without -release, since each set is still owned
by its not-yet-freed buffer) before releasing the collection. The backing
buffers are reclaimed by the OS as the process exits. No behavior change when all
buffers were freed (the array is empty).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AlexCherrypi AlexCherrypi requested a review from a team as a code owner June 9, 2026 15:54
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 9, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

Hi @AlexCherrypi, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant