Skip to content

refactor: reflection-based search mapping + location geopoint#2659

Draft
dschmidt wants to merge 22 commits intoopencloud-eu:mainfrom
dschmidt:refactor/search-mapping
Draft

refactor: reflection-based search mapping + location geopoint#2659
dschmidt wants to merge 22 commits intoopencloud-eu:mainfrom
dschmidt:refactor/search-mapping

Conversation

@dschmidt
Copy link
Copy Markdown
Contributor

@dschmidt dschmidt commented Apr 22, 2026

Summary

Merges the bleve and OpenSearch index mappings into one reflection-based package (services/search/pkg/mapping) driven by the search.Resource struct + a small overrides map. Pulls services/graph's duplicated reflection walker onto the same helpers and, as a showcase of what the refactor enables, turns on location_geopoint indexing for spatial queries on both backends.

Adding a new facet (motionPhoto, ...) is now roughly: one struct field on content.Document, one line in service.go, one line in graph, one line per backend hit converter, plus whatever tika extraction logic the facet actually needs. Everything else falls out of reflection.

Behavior changes (deliberate)

Existing indexes keep their stored mapping; the new shape only applies to newly-created indexes.

  • OpenSearch Tags / Favorites: dynamic keywordtext + lowercaseKeyword (unified with bleve; the analyzer is registered in the new index settings).
  • OpenSearch facet sub-strings (audio.*, photo.*, image.*): dynamic text + keyword multi-field → keyword-only. The tokenized path was never reachable from KQL anyway (no dot-syntax + pre-fix(search): preserve value case for non-lowercased bleve fields #2633 lowercasing), so no working query regresses; aggregations now produce correct case-preserving buckets on both backends.
  • location: the libregraph {longitude, latitude, altitude} object is preserved at the location key on both backends (numeric sub-field queries like location.latitude:>49 keep working). A sibling location_geopoint is added for geo-distance / bounding-box / polygon queries.
  • graph facet parsing: on malformed metadata the whole facet drops (previously only the malformed field did).

@dschmidt dschmidt force-pushed the refactor/search-mapping branch from ae6414c to 531fd21 Compare April 22, 2026 23:33
@dschmidt dschmidt changed the title refactor(search): build index mappings via reflection + generic helpers refactor: reflection-based search mapping + location geopoint Apr 23, 2026
dschmidt added a commit to dschmidt/opencloud that referenced this pull request Apr 23, 2026
Propose a single central Go-struct + overrides map as the source of
truth for the search index layout across bleve and OpenSearch — the
same definition drives the per-backend index mapping, the write-time
adapter, the hit-decoding path, and the KQL compiler's case-folding
rules, so the two backends cannot drift silently again.

Also records the end-to-end case-handling principle for facet data
(indexed as case-preserving keywords so aggregation buckets return
correct display values), the sibling-field pattern for geopoint on
Location, and the rationale for replacing the two backends' implicit
defaults with an explicit, backend-agnostic contract.

PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged
from; scope and APIs there will be revisited once this ADR lands.
dschmidt added a commit to dschmidt/opencloud that referenced this pull request Apr 23, 2026
Propose a single central Go-struct + overrides map as the source of
truth for the search index layout across bleve and OpenSearch — the
same definition drives the per-backend index mapping, the write-time
adapter, the hit-decoding path, and the KQL compiler's case-folding
rules, so the two backends cannot drift silently again.

Also records the end-to-end case-handling principle for facet data
(indexed as case-preserving keywords so aggregation buckets return
correct display values), the sibling-field pattern for geopoint on
Location, and the rationale for replacing the two backends' implicit
defaults with an explicit, backend-agnostic contract.

PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged
from; scope and APIs there will be revisited once this ADR lands.
dschmidt added a commit to dschmidt/opencloud that referenced this pull request Apr 23, 2026
Propose a single central Go-struct + overrides map as the source of
truth for the search index layout across bleve and OpenSearch — the
same definition drives the per-backend index mapping, the write-time
adapter, the hit-decoding path, and the KQL compiler's case-folding
rules, so the two backends cannot drift silently again.

Also records the end-to-end case-handling principle for facet data
(indexed as case-preserving keywords so aggregation buckets return
correct display values), the sibling-field pattern for geopoint on
Location, and the rationale for replacing the two backends' implicit
defaults with an explicit, backend-agnostic contract.

PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged
from; scope and APIs there will be revisited once this ADR lands.
dschmidt added a commit to dschmidt/opencloud that referenced this pull request Apr 23, 2026
Propose a single central Go-struct + overrides map as the source of
truth for the search index layout across bleve and OpenSearch — the
same definition drives the per-backend index mapping, the write-time
adapter, the hit-decoding path, and the KQL compiler's case-folding
rules, so the two backends cannot drift silently again.

Also records the end-to-end case-handling principle for facet data
(indexed as case-preserving keywords so aggregation buckets return
correct display values), the sibling-field pattern for geopoint on
Location, and the rationale for replacing the two backends' implicit
defaults with an explicit, backend-agnostic contract.

PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged
from; scope and APIs there will be revisited once this ADR lands.
dschmidt added a commit to dschmidt/opencloud that referenced this pull request Apr 27, 2026
Propose a single central Go-struct + overrides map as the source of
truth for the search index layout across bleve and OpenSearch — the
same definition drives the per-backend index mapping, the write-time
adapter, the hit-decoding path, and the KQL compiler's case-folding
rules, so the two backends cannot drift silently again.

Also records the end-to-end case-handling principle for facet data
(indexed as case-preserving keywords so aggregation buckets return
correct display values), the sibling-field pattern for geopoint on
Location, and the rationale for replacing the two backends' implicit
defaults with an explicit, backend-agnostic contract.

PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged
from; scope and APIs there will be revisited once this ADR lands.
@dragonchaser
Copy link
Copy Markdown
Member

Why the extensive usage of reflections?

@dschmidt
Copy link
Copy Markdown
Contributor Author

dschmidt commented Apr 30, 2026

Short version:
It's there so the search.Resource struct (plus a small overrides map) becomes the single source of truth for everything schema-shaped: the bleve DocumentMapping, the OpenSearch properties JSON, and the hit→struct deserialization on read. One struct, two backends, no parallel schemas to keep in sync.

Keeping all the mappings in sync for the existing facets and having to do a lot of copy paste for adding new facets, is tedious and error prone.
The new code is not completely trivial to read, I agree, but it's basically write once and probably/hopefully never touch again. Basically it pulls together different pieces of reflection usage in a central location and concentrates it in the mapping package, so actually there's less reflection stuff spread around the code base.

Long version

  1. No drift between bleve and OpenSearch. right now there are two hand-maintained mapping definitions plus a third hand-maintained "read fields back out of hit.Fields" path. They have already drifted (e.g. the OpenSearch dynamic-keyword vs bleve text+lowercaseKeyword mismatch the PR description calls out). With reflection both backends walk the same fields with the same json-tag names via walkFields (infer.go), so they can't disagree by accident.

  2. Kills a duplicated walker. services/graph had its own reflection walker doing essentially the same thing; this PR collapses it onto the shared helpers in mapping/infer.go (walkFields, resolveField, inferType, structType).

  3. Adding a facet is now ~5 lines. As the PR description says: one field on content.Document, one line in service.go, one in graph, one per backend hit converter. No new mapping JSON, no new "read this key out of bleve" branch — Deserialize[T] (deserialize.go) handles it from the json tags. That's the whole point of the refactor and it's why location_geopoint could be added as a near-trivial showcase.

  4. json tags are already the contract. Field names on the wire are already driven by struct tags for marshalling, so reusing them via reflection for index field names just removes a second, hand-typed copy of the same names.

  5. Cost is bounded. Reflection runs once at index-mapping build time (startup) and once per hit on read. It's not on a hot inner loop and it's not in the query path — bleve/OpenSearch do the actual searching natively against the materialized mapping.

The alternative would be either codegen (more machinery for the same outcome) or keeping the three parallel hand-written schemas (which is what already doesnt work right now and what motivated the refactor).

@dschmidt
Copy link
Copy Markdown
Contributor Author

By the way looking at the line counts is a bit misleading. It's not 1.8k lines plus for a simple refactor.

Yes, it's a bit more code for staying consistent and making further additions easier - but a lot of the new lines are also for tests we simply didn't have before and it also adds the geopoint feature (we can of course discuss to split it out of this PR if you prefer, it's basically a demonstrator for the concept)

@dschmidt dschmidt deleted the branch opencloud-eu:main May 5, 2026 10:35
@dschmidt dschmidt closed this May 5, 2026
@dschmidt dschmidt reopened this May 5, 2026
@dschmidt dschmidt changed the base branch from fix/search-preserve-value-case to main May 5, 2026 10:56
dschmidt added 15 commits May 5, 2026 14:00
Use the exact current Go field names as json tags so the bleve index
schema and OpenSearch JSON-unmarshal behavior stay identical. No
reindex needed.

Prepares the ground for a reflection-based mapping builder that uses
json tags as the single source of truth for field names.
Introduce services/search/pkg/mapping as the foundation for a
reflection-based index mapping builder. This commit contains only the
building blocks:

- opts.go: FieldOpts struct and Type* constants
- infer.go: Go type → mapping type inference, json-tag field resolution,
  embedded-struct flattening walker
- validate.go: checks that override keys reference real json-tag paths

No production code uses these yet; followups will wire up BleveBuild-
Mapping, OpenSearchBuildMapping, PrepareForIndex, and Deserialize on top.
Build a bleve DocumentMapping from a Go struct via reflection, using
json tags for field names and FieldOpts overrides for analyzer / type
tweaks. Nested structs recurse into sub-document mappings.

Also extend FieldOpts with IncludeInAll so call sites can reproduce
the existing bleve mapping exactly (Name stays in _all, Tags /
Favorites / Content opt out).

Not yet wired up; Step 7 will replace bleve/index.go NewMapping.
Build the OpenSearch mappings.properties map from a Go struct via
reflection, using the same FieldOpts overrides as the bleve builder.
Type mapping is tailored to OpenSearch primitives (keyword, text,
long/integer/double, date, boolean, wildcard, geo_point), with
nested structs emitted as {"properties": {...}}.

Also add TypeWildcard to the shared type constants so MimeType can be
reproduced exactly in Step 11. Bleve has no wildcard type, so it falls
back to keyword-ish text.

Not yet wired up; Step 11 will replace the static resource_v2.json.
Convert a struct to the flat map[string]any form bleve expects, via a
json marshal/unmarshal round-trip. This is equivalent to what bleve
does internally when given the struct directly, but it gives the
mapping package a hook point for splicing in type-specific adaptations
(geopoint being the first planned consumer; the overrides parameter is
already part of the signature for that reason).

Not yet wired up; Step 8 will route bleve/batch.go Upsert through here.
Reconstruct a typed value from bleve's flat hit.Fields map:

- Nested struct pointers recurse under a "parent.child" prefix and stay
  nil when none of their sub-fields were present.
- Slice fields accept bleve's scalar-on-single-element quirk and wrap
  scalars into a single-element slice.
- time.Time and timestamppb.Timestamp are parsed from RFC3339 strings.
- Numeric types are converted via reflect (float64 → uint64, etc).

Replaces the hand-rolled getAudioValue / getImageValue / getLocation-
Value / getPhotoValue + unmarshalInterfaceMap pair that Step 9 will
remove in favor of this generic walker.

Not yet wired up.
Replace the hand-rolled field mappings in bleve/index.go NewMapping
with a call through the new mapping package:

- search.Resource.SearchFieldOverrides() is the single source of truth
  for the per-backend field options (Name, Content, Tags, Favorites).
- mapping.BleveBuildMapping walks the Resource struct and emits the
  DocumentMapping from json tags + overrides.
- mapping.Validate runs at startup so a typo in an override key fails
  loudly instead of silently being ignored.

The custom analyzers (lowercaseKeyword, fulltext) remain registered on
the IndexMapping here — the mapping package only references them by
name. Behavior is preserved: the existing ginkgo suite exercises Upsert
+ Search roundtrips and still passes.
…Index

All four write paths in bleve/batch.go (Upsert, Move, Delete, Restore)
now go through a shared indexResource helper that calls mapping.Prepare-
ForIndex before handing the document to bleve. That gives the mapping
package a single hook point for type-specific adaptations (geopoint
adapter lands in the showcase commit at the end).

Bleve's own reflection-based field walker behaved the same way the
json roundtrip does here, so the ginkgo suite still passes unchanged.
bleve/bleve.go: matchToResource now goes through the generic
mapping.Deserialize[search.Resource]. Drop getAudioValue, getImage-
Value, getLocationValue, getPhotoValue, unmarshalInterfaceMap, new-
PointerOfType and getFieldName — the mapping package covers all of
them. The legacy "only expose Audio for MimeType audio/*" safety net
is preserved as a post-processing step.

bleve/backend.go: Search() now uses mapping.DeserializeAt for the
Audio / Image / Location / Photo facets on the protobuf Match. This
relies on the json tags generated on the proto structs.

Also add DeserializeAt for reading sub-trees of the flat fields map
under a dotted prefix, returning nil when nothing matches so the
enclosing pointer stays nil.
Drop internal/indexes/resource_v2.json and build the index body from
search.Resource + mapping.OpenSearchBuildMapping inside index.go. The
V2 IndexManager sentinel now routes MarshalJSON to buildResource-
V2Mapping instead of reading an embedded file; V1 still reads from
the embedded FS for migration compatibility.

The generated mapping is a completed superset of the old static JSON:

- Content: text + term_vector, no explicit analyzer (dropped the
  "fulltext" default for OpenSearch since the analyzer is a bleve-
  only thing; resource_v2.json did not set one either).
- Path: text + path_hierarchy (unchanged).
- MimeType: wildcard (unchanged; doc_values: false is dropped as a
  minor storage tradeoff).
- Name / Tags / Favorites: text + lowercaseKeyword. This unifies
  OpenSearch behavior with bleve's (previously Favorites was plain
  "keyword" on OpenSearch, so a lowercaseKeyword analyzer is now
  registered in the settings so OpenSearch can speak the same name).
- Facet fields (audio, image, location, photo): now explicit nested
  object mappings driven by the proto-embedded json tags, instead of
  relying on OpenSearch's dynamic templating.

Existing OpenSearch indexes keep their stored mapping; the new shape
only applies to new indexes. Queries keep working because our query
generation does not rely on the specific mapping primitives that
changed here.
The four add{Audio,Image,Location,Photo}Metadata functions were
identical modulo the facet type and metadata key prefix. Replace them
with a single generic addFacetMetadata parameterised on the libregraph
MappedNullable type. The nil check moves into the generic body via
reflect, so the doUpsertItem call site reads as four one-liners.
Twin of Deserialize[T] but for string-typed flat maps (CS3's
ArbitraryMetadata is map[string]string, not map[string]any). Parses
raw strings into string / bool / intN / uintN / floatN / time.Time /
timestamppb.Timestamp target fields via strconv and time.Parse, using
the same json-tag walker fillStruct uses. Returns (nil, nil) when no
field under the prefix matched so callers can leave the enclosing
pointer nil.

Prepares a follow-up in services/graph to replace the hand-rolled
unmarshalStringMap + four cs3ResourceToDriveItem*Facet functions.
…tringMap

services/graph/pkg/service/v0/driveitems.go had a parallel copy of the
same hand-rolled reflection walker that services/search/pkg/bleve used
to have (unmarshalStringMap + getFieldName), plus four near-identical
cs3ResourceToDriveItem{Audio,Image,Location,Photo}Facet wrappers. The
mapping package already provides this machinery — point graph at it.

A small local setFacet helper swallows parse errors and logs them,
matching the original fail-soft behavior (bad metadata leaves the
facet nil, it does not propagate).

Adding a new facet (motionPhoto, ...) is now one line in the call
site of cs3ResourceInfoToDriveItem.
…helper

OpenSearchHitToMatch had four near-identical inline closures for the
Audio / Image / Location / Photo facets, each calling conversions.To
from the libregraph indexed shape to the protobuf searchMessage
shape. Replace them with a small local generic helper:

  func copyFacet[Dst, Src any](src *Src) *Dst

The Audio MimeType guard stays visible at the call site instead of
hiding inside the closure.

Net -10 lines, and adding a new facet (motionPhoto, ...) to this
conversion is now one line instead of five.
Showcase commit: after the mapping refactor, turning on geopoint
indexing for the Location facet is essentially one line —
SearchFieldOverrides now includes:

  "location": {Type: mapping.TypeGeopoint}

Everything else falls out of the existing pipeline:

- BleveBuildMapping emits a sub-document at "location" that carries
  both a geopoint field (for geo_distance queries) and explicit
  numeric sub-fields for longitude / latitude / altitude. The
  sub-fields make hit.Fields round-trip the full GeoCoordinates
  instead of leaving altitude at the mercy of Dynamic mapping; this
  preserves Move / Delete / Restore.
- OpenSearchBuildMapping emits {"type": "geo_point"}; altitude stays
  available via the OpenSearch _source round-trip.
- PrepareForIndex adapts the flat "location" map into a geoPoint
  struct bleve recognizes (Lat() / Lon() methods + exported
  longitude / latitude / altitude fields bleve indexes as sub-fields).

Covered by unit tests in bleve/geo_verify_test.go:
- location.longitude / latitude / altitude all land in hit.Fields
- numeric range queries on each of the three sub-fields return the
  expected hit (and miss when the range excludes the indexed value)
- bleve.GeoDistanceQuery matches when inside the radius and misses
  when the search point is outside
dschmidt added 7 commits May 5, 2026 14:08
Reshape TypeGeopoint so the libregraph facet (longitude / latitude /
altitude) stays as a plain object sub-document — queryable the usual
way (e.g. location.latitude:>49) and fully round-trippable via
hit.Fields / _source — and a sibling "<name>_geopoint" field carries
the {lat, lon} form the spatial indices understand.

Same shape on both backends:

  bleve:      sub-document with numeric sub-properties
              + sibling "<name>_geopoint" with NewGeoPointFieldMapping
  OpenSearch: { "properties": { longitude, latitude, altitude } }
              + sibling "<name>_geopoint" with {"type": "geo_point"}

PrepareForIndex now walks overrides and, for each TypeGeopoint entry
at any dotted path (e.g. "location" or "journey.start"), inserts the
{lat, lon} sibling at the same parent level. The original facet map
stays untouched — no struct wrapper, no value replacement — so
multiple geopoints per facet work automatically (journey.start and
journey.end both yield siblings journey.start_geopoint and
journey.end_geopoint).

OpenSearch's batch Upsert now routes through mapping.PrepareForIndex
instead of the bare conversions.To[map[string]any], so a single write
path handles the sibling injection for both backends.

Drive-by: emit "doc_values": false on TypeWildcard fields to match
how OpenSearch normalizes wildcard mappings on read; keeps the Apply
up-to-date check stable.

Verified with an OpenSearch 2 container (full services/search/pkg/
opensearch test suite green).
The struct → map[string]any step of PrepareForIndex did its own
json.Marshal + json.Unmarshal, which is exactly what pkg/conversions
provides. Route through conversions.To[map[string]any] and keep the
geopoint sibling adapter as the only interesting thing PrepareForIndex
actually contributes on top of the generic converter.

No behavior change; the generic tests and the live OpenSearch suite
still pass.
… fail-soft

Two tightly-coupled fixes against the hit → struct pipeline:

- Deserialize[T] and DeserializeAt[T] had an unused
  map[string]FieldOpts parameter left over from the pre-sibling-field
  geopoint iteration. Every call site passed either nil or the
  SearchFieldOverrides map, and the function body never looked at
  it. Drop it — the signature now lines up with DeserializeStringMap.

- The previous per-field parse errors (type mismatch in setValue,
  malformed RFC3339 in parseTime) bubbled up and caused matchTo-
  Resource to return nil. Callers in bleve/index.go (searchResource-
  ById, searchResourcesByPath) dereferenced that result directly,
  so a single corrupt field in a hit could panic Move / Delete /
  Restore. The pre-refactor getFieldValue[T] based code was fail-
  soft (type mismatches → zero value, no error). Restore that
  behavior: per-field decode failures leave the affected field at
  its zero value; Deserialize only returns an error when T itself
  isn't a struct. matchToResource now always returns a non-nil
  Resource.

Covered by a new TestDeserializeIsFailSoft unit test that feeds
intentionally corrupt fields and asserts the surrounding record
still round-trips.
Four separate consumers of the indexed Audio facet each carried the
same "only expose Audio when MimeType starts with audio/" check —
bleve.matchToResource, bleve.Search, opensearch/convert.Open-
SearchHitToMatch, graph.cs3ResourceInfoToDriveItem. It existed as a
defensive post-process in case audio.* fields somehow leaked into a
non-audio record.

In practice the write path only populates audio.* for MimeType
audio/* (tika.Extract guards it there), so the read-side check is a
pure micro-optimization that never fires on clean data. Drop all
four instances; trust the write path. If a future corruption mode
actually produces out-of-band audio.* data, the right place to
catch that is closer to the ingestion bug, not on every reader.
Three small follow-ups from the review of the series so far:

- Cache the Resource SearchFieldOverrides map via sync.OnceValue so
  hot paths (matchToResource per hit, indexResource per upsert on
  both backends) reuse the same read-only map instead of rebuilding
  it from scratch each call. Callers that mutate (only the
  OpenSearch index builder) keep using maps.Clone.

- Turn the OpenSearch IndexManager MarshalJSON dispatch into a map
  lookup (indexGenerators) instead of a hard-coded equality check
  against IndexIndexManagerResourceV2. Adding a future resource_v3
  variant is now one entry in the map.

- Add a DeserializeStringMap test that covers the embedded-struct
  flattening branch (fillStructFromStrings's fi.Embedded leg) —
  libregraph's generated types happen to not use embedding, but the
  walker supports it and the test guards against regressions.
The fail-soft rewrite of setSlice added a reflect.New per element to
accommodate per-element failure skipping, which pushed Deserialize-
Resource from 68 to ~78 allocs per call. Compact the slice in place
instead: pre-allocate once, slide successful writes into j, trim to
out[:j] at the end. Same fail-soft semantics (failing items drop
out of the result), single MakeSlice regardless of item count.

Benchmark delta vs the fail-soft commit:
  DeserializeResource: 9379 → 8903 ns (-5%), 78 → 70 allocs (-10%)
The bleve KQL compiler kept a hand-maintained set of field names whose
query values need to be pre-lowercased to match the lowercasing
analyzer at index time. The comment on the set even said "Keep in
sync with services/search/pkg/bleve/index.go NewMapping" — a classic
drift hazard that's now resolved: the same overrides map that drives
the index mapping also drives the compiler.

A field ends up in the set if its override selects the lowercase-
Keyword analyzer or the fulltext type (whose analyzer lowercases in
its token filter). Anything else preserves case both at index time
and query time. Behavior unchanged for the current overrides (Name,
Tags, Favorites, Content); a future override that picks a lowercase-
ing analyzer will get the matching query-side fold for free.
@dschmidt dschmidt force-pushed the refactor/search-mapping branch from 87589de to a25f04a Compare May 5, 2026 12:09
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants