mtmd: build_vit batching#24352
Conversation
|
can you run note: granite is known to be broken |
I have already, there are two that are failing when I run
|
|
@ngxson |
| std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos, | ||
| const build_vit_opts & opts | ||
| ) { | ||
| // batch dim: inp is [n_embd, n_pos] (B==1) or [n_embd, n_pos, B] (multi-tile encode) |
There was a problem hiding this comment.
note that batching is not just for multi-tile encode, but it should eventually allow batching multiple images of same size. that will be important for video processing where we need to process multiple images in the same pass
I will fix this comment along with my refactoring to add the proper architecture for doing so
* upstream/HEAD: (329 commits) vendor : update LibreSSL to 4.3.2 (ggml-org#24397) Remove padding and multiple D2D copies for MTP (ggml-org#24086) chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377) CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360) ci : bump komac version (ggml-org#24396) speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253) webui: implement pinned conversations support (ggml-org#21387) graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357) ci : fix windows release (ggml-org#24369) ui: add opt-in run_javascript frontend tool (ggml-org#24244) mtmd: build_vit batching (ggml-org#24352) vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287) vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123) ui: Fix excessive style recalculation on hover (ggml-org#24243) mtmd: refactor video subproc handling (ggml-org#24316) server: log prompts to directory (ggml-org#22031) ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158) ggml : add GGML_OP_COL2IM_1D (ggml-org#24206) server : do not clear slots without unified KV cache (ggml-org#24190) models : fix plamo2 attention_key/value_length regression (ggml-org#24317) ...
Overview
This PR introduces an optional batch dimension in
build_vit, so acaller can encode several same-size inputs (image tiles, frames) in one graph.
No change for existing models: that means for a 2D
[n_embd, n_pos]input (
B == 1), nothing changes.Changes
build_vittakesinpas[n_embd, n_pos]or[n_embd, n_pos, B].[n_embd, n_pos * B]; the batch only reappears inself-attention as 4D
[d_head, n_head, n_pos, B]Q/K/V views. Output restoredto
[n_embd, n_pos, B].First consumer: DeepSeek-OCR multi-tile encoding (#24300, stacked on this).
Testing
Built
llama-mtmd-cli; DeepSeek-OCR single-view still matches (theB == 1path).Ran
tools/mtmd/tests.sh big;all tests that pass on master pass here too.
The
hugevariant is not tested.Requirements