Skip to content

fix: handle missing output_dimension in embedding computation#4

Open
neotty wants to merge 1 commit intosqliteai:mainfrom
neotty:rui-dev-n_embd
Open

fix: handle missing output_dimension in embedding computation#4
neotty wants to merge 1 commit intosqliteai:mainfrom
neotty:rui-dev-n_embd

Conversation

@neotty
Copy link
Copy Markdown

@neotty neotty commented Apr 27, 2026

Summary

This MR improves compatibility of remote embedding response parsing by handling providers that do not return output_dimension.

Problem

The current parser requires both:

  • a valid embedding array
  • a non-zero output_dimension

For OpenAI-style embedding responses, output_dimension may be absent even when embedding is valid.
In that case, parsing fails with: "Missing embedding data in API response".

Changes

  • Added a fallback for embedding dimension resolution:
    • If output_dimension is missing (or parsed as 0), use the embedding array length as n_embd.
  • Kept existing validation behavior for invalid or empty embedding arrays.
  • No API/interface changes.

Impact

  • Fixes false-negative parsing errors for valid responses that omit output_dimension.
  • Preserves existing behavior for providers that do return output_dimension.
  • Backward compatible.

Validation

  • Confirmed successful parsing for responses with:
    • embedding + output_dimension
    • embedding without output_dimension
  • Error handling remains unchanged for malformed responses (e.g., missing/empty embedding).

Co-authored-by: Copilot <copilot@github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 01:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves robustness of the remote embedding response parser by allowing successful parsing when providers return a valid embedding array but omit output_dimension.

Changes:

  • Add a fallback to derive n_embd from the embedding array length when output_dimension is missing/zero.
  • Preserve existing error behavior for missing/empty embedding arrays.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dbmem-rembed.c
Comment on lines +509 to +512
// Some providers do not return output_dimension; fallback to embedding array length.
if (n_embd == 0 && emb_count > 0) {
n_embd = (int)emb_count;
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fallback triggers whenever n_embd == 0, which also happens if output_dimension is present but non-numeric/invalid (since atoi() returns 0). That changes behavior from “reject invalid output_dimension” to “silently ignore it”. Consider tracking whether the output_dimension key was actually present (e.g., bool saw_output_dimension) and only falling back when it’s absent, and/or validating token type + parsing with strtol to detect invalid values explicitly.

Copilot uses AI. Check for mistakes.
Comment thread src/dbmem-rembed.c
Comment on lines 514 to 516
if (emb_start < 0 || emb_count == 0 || n_embd == 0) {
dbmem_context_set_error(engine->context, "Missing embedding data in API response");
return -1;
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure n_embd matches the actual parsed embedding length (emb_count) before returning success. If output_dimension is larger than emb_count, downstream code uses result->n_embd to size the blob (e.g., sqlite3_bind_blob(..., result->n_embd * sizeof(float), ...)), which would read past the allocated engine->embedding buffer. Suggest validating n_embd == (int)emb_count (or at least n_embd <= emb_count) and returning an error on mismatch, or overriding n_embd with emb_count when they differ.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants