fix: handle missing output_dimension in embedding computation#4
fix: handle missing output_dimension in embedding computation#4neotty wants to merge 1 commit intosqliteai:mainfrom
Conversation
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
This PR improves robustness of the remote embedding response parser by allowing successful parsing when providers return a valid embedding array but omit output_dimension.
Changes:
- Add a fallback to derive
n_embdfrom the embedding array length whenoutput_dimensionis missing/zero. - Preserve existing error behavior for missing/empty embedding arrays.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Some providers do not return output_dimension; fallback to embedding array length. | ||
| if (n_embd == 0 && emb_count > 0) { | ||
| n_embd = (int)emb_count; | ||
| } |
There was a problem hiding this comment.
The new fallback triggers whenever n_embd == 0, which also happens if output_dimension is present but non-numeric/invalid (since atoi() returns 0). That changes behavior from “reject invalid output_dimension” to “silently ignore it”. Consider tracking whether the output_dimension key was actually present (e.g., bool saw_output_dimension) and only falling back when it’s absent, and/or validating token type + parsing with strtol to detect invalid values explicitly.
| if (emb_start < 0 || emb_count == 0 || n_embd == 0) { | ||
| dbmem_context_set_error(engine->context, "Missing embedding data in API response"); | ||
| return -1; |
There was a problem hiding this comment.
Please ensure n_embd matches the actual parsed embedding length (emb_count) before returning success. If output_dimension is larger than emb_count, downstream code uses result->n_embd to size the blob (e.g., sqlite3_bind_blob(..., result->n_embd * sizeof(float), ...)), which would read past the allocated engine->embedding buffer. Suggest validating n_embd == (int)emb_count (or at least n_embd <= emb_count) and returning an error on mismatch, or overriding n_embd with emb_count when they differ.
Summary
This MR improves compatibility of remote embedding response parsing by handling providers that do not return
output_dimension.Problem
The current parser requires both:
embeddingarrayoutput_dimensionFor OpenAI-style embedding responses,
output_dimensionmay be absent even whenembeddingis valid.In that case, parsing fails with: "Missing embedding data in API response".
Changes
output_dimensionis missing (or parsed as0), use theembeddingarray length asn_embd.Impact
output_dimension.output_dimension.Validation
embedding+output_dimensionembeddingwithoutoutput_dimensionembedding).