mtmd: add batching API#24384
Conversation
|
Hi @ngxson, I just wanted to thank you for the time and patience you put into reviewing my PRs. I have learned a lot about llama.cpp in general, but especially mtmd, through that work. I would like to use that experience to help the team. If you would trust me with it, I would be glad to help with refactoring like #24384, and with the follow-up of migrating the existing models to the new batching API. The migration part especially feels like a good fit for what I have learned. No pressure either way — just tell me the shape you want and I will follow it. Also, related to this: I did some profiling on whether batching gives a significant speed gain, and on the GPU memory overhead, testing on an M3 Max and a few small Nvidia GPUs. On small consumer-grade GPUs the speed gain was not large. Happy to share the numbers if useful. |
Overview
Supersede #24300
Also fix #24380
Add a generic batching API to mtmd and wire it up to
llama-server, the goal is to speed up llava-uhd-style models and at the same time, improve video processing speedCurrent state:
llama-servercan use it correctlyTODO:
mtmd_batch_add_chunkshould only accept input with same sizemtmd_batch_encodeto use the 4th batch dim, added via mtmd: build_vit batching #24352build_vit()models for nowmtmd-clito reflect the usage --> not sure, maybe a follow-up PR is betterRequirements