refactor: reflection-based search mapping + location geopoint#2659
refactor: reflection-based search mapping + location geopoint#2659dschmidt wants to merge 22 commits intoopencloud-eu:mainfrom
Conversation
ae6414c to
531fd21
Compare
Propose a single central Go-struct + overrides map as the source of truth for the search index layout across bleve and OpenSearch — the same definition drives the per-backend index mapping, the write-time adapter, the hit-decoding path, and the KQL compiler's case-folding rules, so the two backends cannot drift silently again. Also records the end-to-end case-handling principle for facet data (indexed as case-preserving keywords so aggregation buckets return correct display values), the sibling-field pattern for geopoint on Location, and the rationale for replacing the two backends' implicit defaults with an explicit, backend-agnostic contract. PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged from; scope and APIs there will be revisited once this ADR lands.
Propose a single central Go-struct + overrides map as the source of truth for the search index layout across bleve and OpenSearch — the same definition drives the per-backend index mapping, the write-time adapter, the hit-decoding path, and the KQL compiler's case-folding rules, so the two backends cannot drift silently again. Also records the end-to-end case-handling principle for facet data (indexed as case-preserving keywords so aggregation buckets return correct display values), the sibling-field pattern for geopoint on Location, and the rationale for replacing the two backends' implicit defaults with an explicit, backend-agnostic contract. PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged from; scope and APIs there will be revisited once this ADR lands.
Propose a single central Go-struct + overrides map as the source of truth for the search index layout across bleve and OpenSearch — the same definition drives the per-backend index mapping, the write-time adapter, the hit-decoding path, and the KQL compiler's case-folding rules, so the two backends cannot drift silently again. Also records the end-to-end case-handling principle for facet data (indexed as case-preserving keywords so aggregation buckets return correct display values), the sibling-field pattern for geopoint on Location, and the rationale for replacing the two backends' implicit defaults with an explicit, backend-agnostic contract. PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged from; scope and APIs there will be revisited once this ADR lands.
Propose a single central Go-struct + overrides map as the source of truth for the search index layout across bleve and OpenSearch — the same definition drives the per-backend index mapping, the write-time adapter, the hit-decoding path, and the KQL compiler's case-folding rules, so the two backends cannot drift silently again. Also records the end-to-end case-handling principle for facet data (indexed as case-preserving keywords so aggregation buckets return correct display values), the sibling-field pattern for geopoint on Location, and the rationale for replacing the two backends' implicit defaults with an explicit, backend-agnostic contract. PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged from; scope and APIs there will be revisited once this ADR lands.
Propose a single central Go-struct + overrides map as the source of truth for the search index layout across bleve and OpenSearch — the same definition drives the per-backend index mapping, the write-time adapter, the hit-decoding path, and the KQL compiler's case-folding rules, so the two backends cannot drift silently again. Also records the end-to-end case-handling principle for facet data (indexed as case-preserving keywords so aggregation buckets return correct display values), the sibling-field pattern for geopoint on Location, and the rationale for replacing the two backends' implicit defaults with an explicit, backend-agnostic contract. PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged from; scope and APIs there will be revisited once this ADR lands.
|
Why the extensive usage of reflections? |
|
Short version: Keeping all the mappings in sync for the existing facets and having to do a lot of copy paste for adding new facets, is tedious and error prone. Long version
The alternative would be either codegen (more machinery for the same outcome) or keeping the three parallel hand-written schemas (which is what already doesnt work right now and what motivated the refactor). |
|
By the way looking at the line counts is a bit misleading. It's not 1.8k lines plus for a simple refactor. Yes, it's a bit more code for staying consistent and making further additions easier - but a lot of the new lines are also for tests we simply didn't have before and it also adds the geopoint feature (we can of course discuss to split it out of this PR if you prefer, it's basically a demonstrator for the concept) |
Use the exact current Go field names as json tags so the bleve index schema and OpenSearch JSON-unmarshal behavior stay identical. No reindex needed. Prepares the ground for a reflection-based mapping builder that uses json tags as the single source of truth for field names.
Introduce services/search/pkg/mapping as the foundation for a reflection-based index mapping builder. This commit contains only the building blocks: - opts.go: FieldOpts struct and Type* constants - infer.go: Go type → mapping type inference, json-tag field resolution, embedded-struct flattening walker - validate.go: checks that override keys reference real json-tag paths No production code uses these yet; followups will wire up BleveBuild- Mapping, OpenSearchBuildMapping, PrepareForIndex, and Deserialize on top.
Build a bleve DocumentMapping from a Go struct via reflection, using json tags for field names and FieldOpts overrides for analyzer / type tweaks. Nested structs recurse into sub-document mappings. Also extend FieldOpts with IncludeInAll so call sites can reproduce the existing bleve mapping exactly (Name stays in _all, Tags / Favorites / Content opt out). Not yet wired up; Step 7 will replace bleve/index.go NewMapping.
Build the OpenSearch mappings.properties map from a Go struct via
reflection, using the same FieldOpts overrides as the bleve builder.
Type mapping is tailored to OpenSearch primitives (keyword, text,
long/integer/double, date, boolean, wildcard, geo_point), with
nested structs emitted as {"properties": {...}}.
Also add TypeWildcard to the shared type constants so MimeType can be
reproduced exactly in Step 11. Bleve has no wildcard type, so it falls
back to keyword-ish text.
Not yet wired up; Step 11 will replace the static resource_v2.json.
Convert a struct to the flat map[string]any form bleve expects, via a json marshal/unmarshal round-trip. This is equivalent to what bleve does internally when given the struct directly, but it gives the mapping package a hook point for splicing in type-specific adaptations (geopoint being the first planned consumer; the overrides parameter is already part of the signature for that reason). Not yet wired up; Step 8 will route bleve/batch.go Upsert through here.
Reconstruct a typed value from bleve's flat hit.Fields map: - Nested struct pointers recurse under a "parent.child" prefix and stay nil when none of their sub-fields were present. - Slice fields accept bleve's scalar-on-single-element quirk and wrap scalars into a single-element slice. - time.Time and timestamppb.Timestamp are parsed from RFC3339 strings. - Numeric types are converted via reflect (float64 → uint64, etc). Replaces the hand-rolled getAudioValue / getImageValue / getLocation- Value / getPhotoValue + unmarshalInterfaceMap pair that Step 9 will remove in favor of this generic walker. Not yet wired up.
Replace the hand-rolled field mappings in bleve/index.go NewMapping with a call through the new mapping package: - search.Resource.SearchFieldOverrides() is the single source of truth for the per-backend field options (Name, Content, Tags, Favorites). - mapping.BleveBuildMapping walks the Resource struct and emits the DocumentMapping from json tags + overrides. - mapping.Validate runs at startup so a typo in an override key fails loudly instead of silently being ignored. The custom analyzers (lowercaseKeyword, fulltext) remain registered on the IndexMapping here — the mapping package only references them by name. Behavior is preserved: the existing ginkgo suite exercises Upsert + Search roundtrips and still passes.
…Index All four write paths in bleve/batch.go (Upsert, Move, Delete, Restore) now go through a shared indexResource helper that calls mapping.Prepare- ForIndex before handing the document to bleve. That gives the mapping package a single hook point for type-specific adaptations (geopoint adapter lands in the showcase commit at the end). Bleve's own reflection-based field walker behaved the same way the json roundtrip does here, so the ginkgo suite still passes unchanged.
bleve/bleve.go: matchToResource now goes through the generic mapping.Deserialize[search.Resource]. Drop getAudioValue, getImage- Value, getLocationValue, getPhotoValue, unmarshalInterfaceMap, new- PointerOfType and getFieldName — the mapping package covers all of them. The legacy "only expose Audio for MimeType audio/*" safety net is preserved as a post-processing step. bleve/backend.go: Search() now uses mapping.DeserializeAt for the Audio / Image / Location / Photo facets on the protobuf Match. This relies on the json tags generated on the proto structs. Also add DeserializeAt for reading sub-trees of the flat fields map under a dotted prefix, returning nil when nothing matches so the enclosing pointer stays nil.
Drop internal/indexes/resource_v2.json and build the index body from search.Resource + mapping.OpenSearchBuildMapping inside index.go. The V2 IndexManager sentinel now routes MarshalJSON to buildResource- V2Mapping instead of reading an embedded file; V1 still reads from the embedded FS for migration compatibility. The generated mapping is a completed superset of the old static JSON: - Content: text + term_vector, no explicit analyzer (dropped the "fulltext" default for OpenSearch since the analyzer is a bleve- only thing; resource_v2.json did not set one either). - Path: text + path_hierarchy (unchanged). - MimeType: wildcard (unchanged; doc_values: false is dropped as a minor storage tradeoff). - Name / Tags / Favorites: text + lowercaseKeyword. This unifies OpenSearch behavior with bleve's (previously Favorites was plain "keyword" on OpenSearch, so a lowercaseKeyword analyzer is now registered in the settings so OpenSearch can speak the same name). - Facet fields (audio, image, location, photo): now explicit nested object mappings driven by the proto-embedded json tags, instead of relying on OpenSearch's dynamic templating. Existing OpenSearch indexes keep their stored mapping; the new shape only applies to new indexes. Queries keep working because our query generation does not rely on the specific mapping primitives that changed here.
The four add{Audio,Image,Location,Photo}Metadata functions were
identical modulo the facet type and metadata key prefix. Replace them
with a single generic addFacetMetadata parameterised on the libregraph
MappedNullable type. The nil check moves into the generic body via
reflect, so the doUpsertItem call site reads as four one-liners.
Twin of Deserialize[T] but for string-typed flat maps (CS3's ArbitraryMetadata is map[string]string, not map[string]any). Parses raw strings into string / bool / intN / uintN / floatN / time.Time / timestamppb.Timestamp target fields via strconv and time.Parse, using the same json-tag walker fillStruct uses. Returns (nil, nil) when no field under the prefix matched so callers can leave the enclosing pointer nil. Prepares a follow-up in services/graph to replace the hand-rolled unmarshalStringMap + four cs3ResourceToDriveItem*Facet functions.
…tringMap
services/graph/pkg/service/v0/driveitems.go had a parallel copy of the
same hand-rolled reflection walker that services/search/pkg/bleve used
to have (unmarshalStringMap + getFieldName), plus four near-identical
cs3ResourceToDriveItem{Audio,Image,Location,Photo}Facet wrappers. The
mapping package already provides this machinery — point graph at it.
A small local setFacet helper swallows parse errors and logs them,
matching the original fail-soft behavior (bad metadata leaves the
facet nil, it does not propagate).
Adding a new facet (motionPhoto, ...) is now one line in the call
site of cs3ResourceInfoToDriveItem.
…helper OpenSearchHitToMatch had four near-identical inline closures for the Audio / Image / Location / Photo facets, each calling conversions.To from the libregraph indexed shape to the protobuf searchMessage shape. Replace them with a small local generic helper: func copyFacet[Dst, Src any](src *Src) *Dst The Audio MimeType guard stays visible at the call site instead of hiding inside the closure. Net -10 lines, and adding a new facet (motionPhoto, ...) to this conversion is now one line instead of five.
Showcase commit: after the mapping refactor, turning on geopoint
indexing for the Location facet is essentially one line —
SearchFieldOverrides now includes:
"location": {Type: mapping.TypeGeopoint}
Everything else falls out of the existing pipeline:
- BleveBuildMapping emits a sub-document at "location" that carries
both a geopoint field (for geo_distance queries) and explicit
numeric sub-fields for longitude / latitude / altitude. The
sub-fields make hit.Fields round-trip the full GeoCoordinates
instead of leaving altitude at the mercy of Dynamic mapping; this
preserves Move / Delete / Restore.
- OpenSearchBuildMapping emits {"type": "geo_point"}; altitude stays
available via the OpenSearch _source round-trip.
- PrepareForIndex adapts the flat "location" map into a geoPoint
struct bleve recognizes (Lat() / Lon() methods + exported
longitude / latitude / altitude fields bleve indexes as sub-fields).
Covered by unit tests in bleve/geo_verify_test.go:
- location.longitude / latitude / altitude all land in hit.Fields
- numeric range queries on each of the three sub-fields return the
expected hit (and miss when the range excludes the indexed value)
- bleve.GeoDistanceQuery matches when inside the radius and misses
when the search point is outside
Reshape TypeGeopoint so the libregraph facet (longitude / latitude /
altitude) stays as a plain object sub-document — queryable the usual
way (e.g. location.latitude:>49) and fully round-trippable via
hit.Fields / _source — and a sibling "<name>_geopoint" field carries
the {lat, lon} form the spatial indices understand.
Same shape on both backends:
bleve: sub-document with numeric sub-properties
+ sibling "<name>_geopoint" with NewGeoPointFieldMapping
OpenSearch: { "properties": { longitude, latitude, altitude } }
+ sibling "<name>_geopoint" with {"type": "geo_point"}
PrepareForIndex now walks overrides and, for each TypeGeopoint entry
at any dotted path (e.g. "location" or "journey.start"), inserts the
{lat, lon} sibling at the same parent level. The original facet map
stays untouched — no struct wrapper, no value replacement — so
multiple geopoints per facet work automatically (journey.start and
journey.end both yield siblings journey.start_geopoint and
journey.end_geopoint).
OpenSearch's batch Upsert now routes through mapping.PrepareForIndex
instead of the bare conversions.To[map[string]any], so a single write
path handles the sibling injection for both backends.
Drive-by: emit "doc_values": false on TypeWildcard fields to match
how OpenSearch normalizes wildcard mappings on read; keeps the Apply
up-to-date check stable.
Verified with an OpenSearch 2 container (full services/search/pkg/
opensearch test suite green).
The struct → map[string]any step of PrepareForIndex did its own json.Marshal + json.Unmarshal, which is exactly what pkg/conversions provides. Route through conversions.To[map[string]any] and keep the geopoint sibling adapter as the only interesting thing PrepareForIndex actually contributes on top of the generic converter. No behavior change; the generic tests and the live OpenSearch suite still pass.
… fail-soft Two tightly-coupled fixes against the hit → struct pipeline: - Deserialize[T] and DeserializeAt[T] had an unused map[string]FieldOpts parameter left over from the pre-sibling-field geopoint iteration. Every call site passed either nil or the SearchFieldOverrides map, and the function body never looked at it. Drop it — the signature now lines up with DeserializeStringMap. - The previous per-field parse errors (type mismatch in setValue, malformed RFC3339 in parseTime) bubbled up and caused matchTo- Resource to return nil. Callers in bleve/index.go (searchResource- ById, searchResourcesByPath) dereferenced that result directly, so a single corrupt field in a hit could panic Move / Delete / Restore. The pre-refactor getFieldValue[T] based code was fail- soft (type mismatches → zero value, no error). Restore that behavior: per-field decode failures leave the affected field at its zero value; Deserialize only returns an error when T itself isn't a struct. matchToResource now always returns a non-nil Resource. Covered by a new TestDeserializeIsFailSoft unit test that feeds intentionally corrupt fields and asserts the surrounding record still round-trips.
Four separate consumers of the indexed Audio facet each carried the same "only expose Audio when MimeType starts with audio/" check — bleve.matchToResource, bleve.Search, opensearch/convert.Open- SearchHitToMatch, graph.cs3ResourceInfoToDriveItem. It existed as a defensive post-process in case audio.* fields somehow leaked into a non-audio record. In practice the write path only populates audio.* for MimeType audio/* (tika.Extract guards it there), so the read-side check is a pure micro-optimization that never fires on clean data. Drop all four instances; trust the write path. If a future corruption mode actually produces out-of-band audio.* data, the right place to catch that is closer to the ingestion bug, not on every reader.
Three small follow-ups from the review of the series so far: - Cache the Resource SearchFieldOverrides map via sync.OnceValue so hot paths (matchToResource per hit, indexResource per upsert on both backends) reuse the same read-only map instead of rebuilding it from scratch each call. Callers that mutate (only the OpenSearch index builder) keep using maps.Clone. - Turn the OpenSearch IndexManager MarshalJSON dispatch into a map lookup (indexGenerators) instead of a hard-coded equality check against IndexIndexManagerResourceV2. Adding a future resource_v3 variant is now one entry in the map. - Add a DeserializeStringMap test that covers the embedded-struct flattening branch (fillStructFromStrings's fi.Embedded leg) — libregraph's generated types happen to not use embedding, but the walker supports it and the test guards against regressions.
The fail-soft rewrite of setSlice added a reflect.New per element to accommodate per-element failure skipping, which pushed Deserialize- Resource from 68 to ~78 allocs per call. Compact the slice in place instead: pre-allocate once, slide successful writes into j, trim to out[:j] at the end. Same fail-soft semantics (failing items drop out of the result), single MakeSlice regardless of item count. Benchmark delta vs the fail-soft commit: DeserializeResource: 9379 → 8903 ns (-5%), 78 → 70 allocs (-10%)
The bleve KQL compiler kept a hand-maintained set of field names whose query values need to be pre-lowercased to match the lowercasing analyzer at index time. The comment on the set even said "Keep in sync with services/search/pkg/bleve/index.go NewMapping" — a classic drift hazard that's now resolved: the same overrides map that drives the index mapping also drives the compiler. A field ends up in the set if its override selects the lowercase- Keyword analyzer or the fulltext type (whose analyzer lowercases in its token filter). Anything else preserves case both at index time and query time. Behavior unchanged for the current overrides (Name, Tags, Favorites, Content); a future override that picks a lowercase- ing analyzer will get the matching query-side fold for free.
87589de to
a25f04a
Compare
|



Summary
Merges the bleve and OpenSearch index mappings into one reflection-based package (
services/search/pkg/mapping) driven by thesearch.Resourcestruct + a small overrides map. Pullsservices/graph's duplicated reflection walker onto the same helpers and, as a showcase of what the refactor enables, turns onlocation_geopointindexing for spatial queries on both backends.Adding a new facet (motionPhoto, ...) is now roughly: one struct field on
content.Document, one line inservice.go, one line in graph, one line per backend hit converter, plus whatever tika extraction logic the facet actually needs. Everything else falls out of reflection.Behavior changes (deliberate)
Existing indexes keep their stored mapping; the new shape only applies to newly-created indexes.
Tags/Favorites: dynamickeyword→text + lowercaseKeyword(unified with bleve; the analyzer is registered in the new index settings).audio.*,photo.*,image.*): dynamictext + keywordmulti-field →keyword-only. The tokenized path was never reachable from KQL anyway (no dot-syntax + pre-fix(search): preserve value case for non-lowercased bleve fields #2633 lowercasing), so no working query regresses; aggregations now produce correct case-preserving buckets on both backends.location: the libregraph{longitude, latitude, altitude}object is preserved at thelocationkey on both backends (numeric sub-field queries likelocation.latitude:>49keep working). A siblinglocation_geopointis added for geo-distance / bounding-box / polygon queries.