Skip to content

mtmd: add batching API#24384

Draft
ngxson wants to merge 1 commit into
ggml-org:masterfrom
ngxson:xsn/mtmd_batch_api
Draft

mtmd: add batching API#24384
ngxson wants to merge 1 commit into
ggml-org:masterfrom
ngxson:xsn/mtmd_batch_api

Conversation

@ngxson

@ngxson ngxson commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Overview

Supersede #24300

Also fix #24380

Add a generic batching API to mtmd and wire it up to llama-server, the goal is to speed up llava-uhd-style models and at the same time, improve video processing speed

Current state:

  • llama-server can use it correctly
  • mtmd API implement is mock up, need to implement the proper logic

TODO:

  • add notion of max batch size in mtmd
  • add CLI argument for it
  • mtmd_batch_add_chunk should only accept input with same size
  • wire up mtmd_batch_encode to use the 4th batch dim, added via mtmd: build_vit batching #24352
  • blacklist / whitelist models that can support it --> maybe only support build_vit() models for now
  • maybe update mtmd-cli to reflect the usage --> not sure, maybe a follow-up PR is better

Requirements

@sfallah

sfallah commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @ngxson,

I just wanted to thank you for the time and patience you put into reviewing my PRs. I have learned a lot about llama.cpp in general, but especially mtmd, through that work. I would like to use that experience to help the team.

If you would trust me with it, I would be glad to help with refactoring like #24384, and with the follow-up of migrating the existing models to the new batching API. The migration part especially feels like a good fit for what I have learned.

No pressure either way — just tell me the shape you want and I will follow it.

Also, related to this: I did some profiling on whether batching gives a significant speed gain, and on the GPU memory overhead, testing on an M3 Max and a few small Nvidia GPUs. On small consumer-grade GPUs the speed gain was not large. Happy to share the numbers if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid re-encoding mtmd chunk when prefill MTP context

2 participants