Skip to content

mtmd, llama: shared backend sched#24361

Draft
ngxson wants to merge 1 commit into
masterfrom
xsn/mtmd_shared_sched
Draft

mtmd, llama: shared backend sched#24361
ngxson wants to merge 1 commit into
masterfrom
xsn/mtmd_shared_sched

Conversation

@ngxson

@ngxson ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Overview

This PR demonstrates the possibility of sharing the backend scheduler between libllama and libmtmd.

Currently, llama_context and clip_context both have their own sched, meaning they also have separate compute buffer. This is indeed quite wasteful because they are never used in parallel (i.e. at any given moments, either llama_decode OR mtmd_encode can run, but not both). So I was wondering if we can somehow share the same buffer between the 2, to save memory.

The same idea can also be extended to share the compute buffer between main LLM and the draft model.

However, my PR still misses quite a lot of things, so I decide to keep this as a discussion for now:

  1. This PR completely ignore the case where text model uses less memory than the mtmd model; the buffer will be automatically realloc by GGML, but that will be invisible to end-user
  2. Fit logic won't be compatible with this, it will require a big refactoring
  3. Not sure if there will be side effects on performance. Each time mtmd_encode() runs, it will reset the sched

Additional information

Tested with gemma-4-E4B-it-GGUF:Q4_K_M:

  • On master branch, memory usage is as follow: 107MB (vision) + 154MB (audio) + 400MB (text) = 656MB
  • This PR: only one single 400MB buffer is allocated --> saved ~40% memory

Requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant