Skip to content

Save/Load support for attribute information associated with DocumentProvider#941

Open
sampathrg wants to merge 9 commits intosync-from-cdb-diskannfrom
samapthrg/save-load-document-index
Open

Save/Load support for attribute information associated with DocumentProvider#941
sampathrg wants to merge 9 commits intosync-from-cdb-diskannfrom
samapthrg/save-load-document-index

Conversation

@sampathrg
Copy link
Copy Markdown

This PR adds ability to do Save/Load attribute information associated with DocumentProvider. The information is saved in the following binary format:

┌────────────────────────────────────────────────────────────────────┐
│  Header (17 bytes)                                                 │
│  [u64: num_attribute_entries]                                      │
│  [u64: forward_index_offset]  (byte offset from file start to      │
│                                Section 2)                          │
│  [u8: vector_id_type_tag]                                          │
│      0 = u32                                                       │
│      1 = u64                                                       │
├────────────────────────────────────────────────────────────────────┤
│  Section 1: Attribute Dictionary                                   │
│  Repeated `num_attribute_entries` times:                           │
│                                                                    │
│  [u64: attribute_id]                                               │
│  [u32: field_name_byte_len]                                        │
│  [u8 * field_name_byte_len: UTF-8 field name]                      │
│  [u8: type_tag]                                                    │
│      0 = Bool                                                      │
│      1 = Integer                                                   │
│      2 = Real                                                      │
│      3 = String                                                    │
│      4 = Empty                                                     │
│  [value bytes, depends on type_tag]:                               │
│      Bool:    1 byte  (0 = false, 1 = true)                        │
│      Integer: 8 bytes (i64, little-endian)                         │
│      Real:    8 bytes (f64, little-endian)                         │
│      String:  [u32: byte_len] + [u8 * byte_len: UTF-8 value]       │
│      Empty:   0 bytes                                              │
├────────────────────────────────────────────────────────────────────┤
│  Section 2: Forward Index                                          │
│  [u64: num_nodes_with_labels]                                      │
│  Repeated `num_nodes_with_labels` times:                           │
│                                                                    │
│  [N bytes: node_internal_id (width per vector_id_type_tag)]        │
│  [u32: num_attribute_ids_for_this_node]                            │
│  [u64 * num_attribute_ids: attribute IDs (sorted ascending)]       │
└────────────────────────────────────────────────────────────────────┘

There's a draft RFC for this (#868). I went ahead and created a PR because the RFC had enough details. The PR currently targets an unmerged branch associated with #782. I will target this to main once that PR gets merged.

@sampathrg sampathrg self-assigned this Apr 13, 2026
@sampathrg sampathrg requested a review from a team April 14, 2026 06:29
Comment on lines +70 to +102
pub(crate) trait LabelFilePath {
fn label_file_path(&self) -> String;
}

impl LabelFilePath for String {
fn label_file_path(&self) -> String {
format!("{}.labels.bin", self)
}
}

impl LabelFilePath for AsyncIndexMetadata {
fn label_file_path(&self) -> String {
format!("{}.labels.bin", self.prefix())
}
}

impl LabelFilePath for (u32, AsyncIndexMetadata) {
fn label_file_path(&self) -> String {
format!("{}.labels.bin", self.1.prefix())
}
}

impl LabelFilePath for (u32, u32, DiskGraphOnly) {
fn label_file_path(&self) -> String {
format!("{}.labels.bin", self.2.prefix())
}
}

impl LabelFilePath for AsyncQuantLoadContext {
fn label_file_path(&self) -> String {
format!("{}.labels.bin", self.metadata.prefix())
}
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of the auxiliary input for save_wtih and load_with not having any constrains. Different implementations use different inputs. This would be difficult to maintain, should we constrain the auxiliary input in some meaningful way? Perhaps all of them should implement a trait that returns a prefix?

@harsha-simhadri
Copy link
Copy Markdown
Contributor

@suhasjs @sampathrg how does this proposal relate to the serialization RFC that Suhas proposed?

@sampathrg
Copy link
Copy Markdown
Author

@suhasjs @sampathrg how does this proposal relate to the serialization RFC that Suhas proposed?

@harsha-simhadri - I was made aware of the RFC after I raised the PR. I went through the RFC, from what I can tell this PR will add additional work to the effort proposed by RFC since this PR is using the SaveWith\LoadWith traits that the serialization RFC intends to remove. If this PR is merged before the serialization RFC gets implemented, I can work with the folks who will eventually implement the new serialization scheme to port the changes in this PR or hold off the implementation till the new serialization mechanism is in-place if it's too much of a hassle. From what I can tell of the initial design, I don't see it being difficult to port over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants