refactor: reflection-based search mapping + location geopoint by dschmidt · Pull Request #2659 · opencloud-eu/opencloud

dschmidt · 2026-04-22T19:24:37Z

Summary

Merges the bleve and OpenSearch index mappings into one reflection-based package (services/search/pkg/mapping) driven by the search.Resource struct + a small overrides map. Pulls services/graph's duplicated reflection walker onto the same helpers and, as a showcase of what the refactor enables, turns on location_geopoint indexing for spatial queries on both backends.

Adding a new facet (motionPhoto, ...) is now roughly: one struct field on content.Document, one line in service.go, one line in graph, one line per backend hit converter, plus whatever tika extraction logic the facet actually needs. Everything else falls out of reflection.

Behavior changes (deliberate)

Existing indexes keep their stored mapping; the new shape only applies to newly-created indexes.

OpenSearch Tags / Favorites: dynamic keyword → text + lowercaseKeyword (unified with bleve; the analyzer is registered in the new index settings).
OpenSearch facet sub-strings (audio.*, photo.*, image.*): dynamic text + keyword multi-field → keyword-only. The tokenized path was never reachable from KQL anyway (no dot-syntax + pre-fix(search): preserve value case for non-lowercased bleve fields #2633 lowercasing), so no working query regresses; aggregations now produce correct case-preserving buckets on both backends.
location: the libregraph {longitude, latitude, altitude} object is preserved at the location key on both backends (numeric sub-field queries like location.latitude:>49 keep working). A sibling location_geopoint is added for geo-distance / bounding-box / polygon queries.
graph facet parsing: on malformed metadata the whole facet drops (previously only the malformed field did).

Propose a single central Go-struct + overrides map as the source of truth for the search index layout across bleve and OpenSearch — the same definition drives the per-backend index mapping, the write-time adapter, the hit-decoding path, and the KQL compiler's case-folding rules, so the two backends cannot drift silently again. Also records the end-to-end case-handling principle for facet data (indexed as case-preserving keywords so aggregation buckets return correct display values), the sibling-field pattern for geopoint on Location, and the rationale for replacing the two backends' implicit defaults with an explicit, backend-agnostic contract. PR opencloud-eu#2659 is a proof-of-concept implementation the proposal emerged from; scope and APIs there will be revisited once this ADR lands.

dragonchaser · 2026-04-29T07:59:54Z

Why the extensive usage of reflections?

dschmidt · 2026-04-30T06:52:32Z

Short version:
It's there so the search.Resource struct (plus a small overrides map) becomes the single source of truth for everything schema-shaped: the bleve DocumentMapping, the OpenSearch properties JSON, and the hit→struct deserialization on read. One struct, two backends, no parallel schemas to keep in sync.

Keeping all the mappings in sync for the existing facets and having to do a lot of copy paste for adding new facets, is tedious and error prone.
The new code is not completely trivial to read, I agree, but it's basically write once and probably/hopefully never touch again. Basically it pulls together different pieces of reflection usage in a central location and concentrates it in the mapping package, so actually there's less reflection stuff spread around the code base.

Long version

No drift between bleve and OpenSearch. right now there are two hand-maintained mapping definitions plus a third hand-maintained "read fields back out of hit.Fields" path. They have already drifted (e.g. the OpenSearch dynamic-keyword vs bleve text+lowercaseKeyword mismatch the PR description calls out). With reflection both backends walk the same fields with the same json-tag names via walkFields (infer.go), so they can't disagree by accident.
Kills a duplicated walker. services/graph had its own reflection walker doing essentially the same thing; this PR collapses it onto the shared helpers in mapping/infer.go (walkFields, resolveField, inferType, structType).
Adding a facet is now ~5 lines. As the PR description says: one field on content.Document, one line in service.go, one in graph, one per backend hit converter. No new mapping JSON, no new "read this key out of bleve" branch — Deserialize[T] (deserialize.go) handles it from the json tags. That's the whole point of the refactor and it's why location_geopoint could be added as a near-trivial showcase.
json tags are already the contract. Field names on the wire are already driven by struct tags for marshalling, so reusing them via reflection for index field names just removes a second, hand-typed copy of the same names.
Cost is bounded. Reflection runs once at index-mapping build time (startup) and once per hit on read. It's not on a hot inner loop and it's not in the query path — bleve/OpenSearch do the actual searching natively against the materialized mapping.

The alternative would be either codegen (more machinery for the same outcome) or keeping the three parallel hand-written schemas (which is what already doesnt work right now and what motivated the refactor).

dschmidt · 2026-04-30T08:46:57Z

By the way looking at the line counts is a bit misleading. It's not 1.8k lines plus for a simple refactor.

Yes, it's a bit more code for staying consistent and making further additions easier - but a lot of the new lines are also for tests we simply didn't have before and it also adds the geopoint feature (we can of course discuss to split it out of this PR if you prefer, it's basically a demonstrator for the concept)

Use the exact current Go field names as json tags so the bleve index schema and OpenSearch JSON-unmarshal behavior stay identical. No reindex needed. Prepares the ground for a reflection-based mapping builder that uses json tags as the single source of truth for field names.

Introduce services/search/pkg/mapping as the foundation for a reflection-based index mapping builder. This commit contains only the building blocks: - opts.go: FieldOpts struct and Type* constants - infer.go: Go type → mapping type inference, json-tag field resolution, embedded-struct flattening walker - validate.go: checks that override keys reference real json-tag paths No production code uses these yet; followups will wire up BleveBuild- Mapping, OpenSearchBuildMapping, PrepareForIndex, and Deserialize on top.

Build a bleve DocumentMapping from a Go struct via reflection, using json tags for field names and FieldOpts overrides for analyzer / type tweaks. Nested structs recurse into sub-document mappings. Also extend FieldOpts with IncludeInAll so call sites can reproduce the existing bleve mapping exactly (Name stays in _all, Tags / Favorites / Content opt out). Not yet wired up; Step 7 will replace bleve/index.go NewMapping.

Build the OpenSearch mappings.properties map from a Go struct via reflection, using the same FieldOpts overrides as the bleve builder. Type mapping is tailored to OpenSearch primitives (keyword, text, long/integer/double, date, boolean, wildcard, geo_point), with nested structs emitted as {"properties": {...}}. Also add TypeWildcard to the shared type constants so MimeType can be reproduced exactly in Step 11. Bleve has no wildcard type, so it falls back to keyword-ish text. Not yet wired up; Step 11 will replace the static resource_v2.json.

Convert a struct to the flat map[string]any form bleve expects, via a json marshal/unmarshal round-trip. This is equivalent to what bleve does internally when given the struct directly, but it gives the mapping package a hook point for splicing in type-specific adaptations (geopoint being the first planned consumer; the overrides parameter is already part of the signature for that reason). Not yet wired up; Step 8 will route bleve/batch.go Upsert through here.

Reconstruct a typed value from bleve's flat hit.Fields map: - Nested struct pointers recurse under a "parent.child" prefix and stay nil when none of their sub-fields were present. - Slice fields accept bleve's scalar-on-single-element quirk and wrap scalars into a single-element slice. - time.Time and timestamppb.Timestamp are parsed from RFC3339 strings. - Numeric types are converted via reflect (float64 → uint64, etc). Replaces the hand-rolled getAudioValue / getImageValue / getLocation- Value / getPhotoValue + unmarshalInterfaceMap pair that Step 9 will remove in favor of this generic walker. Not yet wired up.

Replace the hand-rolled field mappings in bleve/index.go NewMapping with a call through the new mapping package: - search.Resource.SearchFieldOverrides() is the single source of truth for the per-backend field options (Name, Content, Tags, Favorites). - mapping.BleveBuildMapping walks the Resource struct and emits the DocumentMapping from json tags + overrides. - mapping.Validate runs at startup so a typo in an override key fails loudly instead of silently being ignored. The custom analyzers (lowercaseKeyword, fulltext) remain registered on the IndexMapping here — the mapping package only references them by name. Behavior is preserved: the existing ginkgo suite exercises Upsert + Search roundtrips and still passes.

…Index All four write paths in bleve/batch.go (Upsert, Move, Delete, Restore) now go through a shared indexResource helper that calls mapping.Prepare- ForIndex before handing the document to bleve. That gives the mapping package a single hook point for type-specific adaptations (geopoint adapter lands in the showcase commit at the end). Bleve's own reflection-based field walker behaved the same way the json roundtrip does here, so the ginkgo suite still passes unchanged.

bleve/bleve.go: matchToResource now goes through the generic mapping.Deserialize[search.Resource]. Drop getAudioValue, getImage- Value, getLocationValue, getPhotoValue, unmarshalInterfaceMap, new- PointerOfType and getFieldName — the mapping package covers all of them. The legacy "only expose Audio for MimeType audio/*" safety net is preserved as a post-processing step. bleve/backend.go: Search() now uses mapping.DeserializeAt for the Audio / Image / Location / Photo facets on the protobuf Match. This relies on the json tags generated on the proto structs. Also add DeserializeAt for reading sub-trees of the flat fields map under a dotted prefix, returning nil when nothing matches so the enclosing pointer stays nil.

Drop internal/indexes/resource_v2.json and build the index body from search.Resource + mapping.OpenSearchBuildMapping inside index.go. The V2 IndexManager sentinel now routes MarshalJSON to buildResource- V2Mapping instead of reading an embedded file; V1 still reads from the embedded FS for migration compatibility. The generated mapping is a completed superset of the old static JSON: - Content: text + term_vector, no explicit analyzer (dropped the "fulltext" default for OpenSearch since the analyzer is a bleve- only thing; resource_v2.json did not set one either). - Path: text + path_hierarchy (unchanged). - MimeType: wildcard (unchanged; doc_values: false is dropped as a minor storage tradeoff). - Name / Tags / Favorites: text + lowercaseKeyword. This unifies OpenSearch behavior with bleve's (previously Favorites was plain "keyword" on OpenSearch, so a lowercaseKeyword analyzer is now registered in the settings so OpenSearch can speak the same name). - Facet fields (audio, image, location, photo): now explicit nested object mappings driven by the proto-embedded json tags, instead of relying on OpenSearch's dynamic templating. Existing OpenSearch indexes keep their stored mapping; the new shape only applies to new indexes. Queries keep working because our query generation does not rely on the specific mapping primitives that changed here.

The four add{Audio,Image,Location,Photo}Metadata functions were identical modulo the facet type and metadata key prefix. Replace them with a single generic addFacetMetadata parameterised on the libregraph MappedNullable type. The nil check moves into the generic body via reflect, so the doUpsertItem call site reads as four one-liners.

Twin of Deserialize[T] but for string-typed flat maps (CS3's ArbitraryMetadata is map[string]string, not map[string]any). Parses raw strings into string / bool / intN / uintN / floatN / time.Time / timestamppb.Timestamp target fields via strconv and time.Parse, using the same json-tag walker fillStruct uses. Returns (nil, nil) when no field under the prefix matched so callers can leave the enclosing pointer nil. Prepares a follow-up in services/graph to replace the hand-rolled unmarshalStringMap + four cs3ResourceToDriveItem*Facet functions.

…tringMap services/graph/pkg/service/v0/driveitems.go had a parallel copy of the same hand-rolled reflection walker that services/search/pkg/bleve used to have (unmarshalStringMap + getFieldName), plus four near-identical cs3ResourceToDriveItem{Audio,Image,Location,Photo}Facet wrappers. The mapping package already provides this machinery — point graph at it. A small local setFacet helper swallows parse errors and logs them, matching the original fail-soft behavior (bad metadata leaves the facet nil, it does not propagate). Adding a new facet (motionPhoto, ...) is now one line in the call site of cs3ResourceInfoToDriveItem.

…helper OpenSearchHitToMatch had four near-identical inline closures for the Audio / Image / Location / Photo facets, each calling conversions.To from the libregraph indexed shape to the protobuf searchMessage shape. Replace them with a small local generic helper: func copyFacet[Dst, Src any](src *Src) *Dst The Audio MimeType guard stays visible at the call site instead of hiding inside the closure. Net -10 lines, and adding a new facet (motionPhoto, ...) to this conversion is now one line instead of five.

Showcase commit: after the mapping refactor, turning on geopoint indexing for the Location facet is essentially one line — SearchFieldOverrides now includes: "location": {Type: mapping.TypeGeopoint} Everything else falls out of the existing pipeline: - BleveBuildMapping emits a sub-document at "location" that carries both a geopoint field (for geo_distance queries) and explicit numeric sub-fields for longitude / latitude / altitude. The sub-fields make hit.Fields round-trip the full GeoCoordinates instead of leaving altitude at the mercy of Dynamic mapping; this preserves Move / Delete / Restore. - OpenSearchBuildMapping emits {"type": "geo_point"}; altitude stays available via the OpenSearch _source round-trip. - PrepareForIndex adapts the flat "location" map into a geoPoint struct bleve recognizes (Lat() / Lon() methods + exported longitude / latitude / altitude fields bleve indexes as sub-fields). Covered by unit tests in bleve/geo_verify_test.go: - location.longitude / latitude / altitude all land in hit.Fields - numeric range queries on each of the three sub-fields return the expected hit (and miss when the range excludes the indexed value) - bleve.GeoDistanceQuery matches when inside the radius and misses when the search point is outside

Reshape TypeGeopoint so the libregraph facet (longitude / latitude / altitude) stays as a plain object sub-document — queryable the usual way (e.g. location.latitude:>49) and fully round-trippable via hit.Fields / _source — and a sibling "<name>_geopoint" field carries the {lat, lon} form the spatial indices understand. Same shape on both backends: bleve: sub-document with numeric sub-properties + sibling "<name>_geopoint" with NewGeoPointFieldMapping OpenSearch: { "properties": { longitude, latitude, altitude } } + sibling "<name>_geopoint" with {"type": "geo_point"} PrepareForIndex now walks overrides and, for each TypeGeopoint entry at any dotted path (e.g. "location" or "journey.start"), inserts the {lat, lon} sibling at the same parent level. The original facet map stays untouched — no struct wrapper, no value replacement — so multiple geopoints per facet work automatically (journey.start and journey.end both yield siblings journey.start_geopoint and journey.end_geopoint). OpenSearch's batch Upsert now routes through mapping.PrepareForIndex instead of the bare conversions.To[map[string]any], so a single write path handles the sibling injection for both backends. Drive-by: emit "doc_values": false on TypeWildcard fields to match how OpenSearch normalizes wildcard mappings on read; keeps the Apply up-to-date check stable. Verified with an OpenSearch 2 container (full services/search/pkg/ opensearch test suite green).

The struct → map[string]any step of PrepareForIndex did its own json.Marshal + json.Unmarshal, which is exactly what pkg/conversions provides. Route through conversions.To[map[string]any] and keep the geopoint sibling adapter as the only interesting thing PrepareForIndex actually contributes on top of the generic converter. No behavior change; the generic tests and the live OpenSearch suite still pass.

… fail-soft Two tightly-coupled fixes against the hit → struct pipeline: - Deserialize[T] and DeserializeAt[T] had an unused map[string]FieldOpts parameter left over from the pre-sibling-field geopoint iteration. Every call site passed either nil or the SearchFieldOverrides map, and the function body never looked at it. Drop it — the signature now lines up with DeserializeStringMap. - The previous per-field parse errors (type mismatch in setValue, malformed RFC3339 in parseTime) bubbled up and caused matchTo- Resource to return nil. Callers in bleve/index.go (searchResource- ById, searchResourcesByPath) dereferenced that result directly, so a single corrupt field in a hit could panic Move / Delete / Restore. The pre-refactor getFieldValue[T] based code was fail- soft (type mismatches → zero value, no error). Restore that behavior: per-field decode failures leave the affected field at its zero value; Deserialize only returns an error when T itself isn't a struct. matchToResource now always returns a non-nil Resource. Covered by a new TestDeserializeIsFailSoft unit test that feeds intentionally corrupt fields and asserts the surrounding record still round-trips.

Four separate consumers of the indexed Audio facet each carried the same "only expose Audio when MimeType starts with audio/" check — bleve.matchToResource, bleve.Search, opensearch/convert.Open- SearchHitToMatch, graph.cs3ResourceInfoToDriveItem. It existed as a defensive post-process in case audio.* fields somehow leaked into a non-audio record. In practice the write path only populates audio.* for MimeType audio/* (tika.Extract guards it there), so the read-side check is a pure micro-optimization that never fires on clean data. Drop all four instances; trust the write path. If a future corruption mode actually produces out-of-band audio.* data, the right place to catch that is closer to the ingestion bug, not on every reader.

Three small follow-ups from the review of the series so far: - Cache the Resource SearchFieldOverrides map via sync.OnceValue so hot paths (matchToResource per hit, indexResource per upsert on both backends) reuse the same read-only map instead of rebuilding it from scratch each call. Callers that mutate (only the OpenSearch index builder) keep using maps.Clone. - Turn the OpenSearch IndexManager MarshalJSON dispatch into a map lookup (indexGenerators) instead of a hard-coded equality check against IndexIndexManagerResourceV2. Adding a future resource_v3 variant is now one entry in the map. - Add a DeserializeStringMap test that covers the embedded-struct flattening branch (fillStructFromStrings's fi.Embedded leg) — libregraph's generated types happen to not use embedding, but the walker supports it and the test guards against regressions.

The fail-soft rewrite of setSlice added a reflect.New per element to accommodate per-element failure skipping, which pushed Deserialize- Resource from 68 to ~78 allocs per call. Compact the slice in place instead: pre-allocate once, slide successful writes into j, trim to out[:j] at the end. Same fail-soft semantics (failing items drop out of the result), single MakeSlice regardless of item count. Benchmark delta vs the fail-soft commit: DeserializeResource: 9379 → 8903 ns (-5%), 78 → 70 allocs (-10%)

The bleve KQL compiler kept a hand-maintained set of field names whose query values need to be pre-lowercased to match the lowercasing analyzer at index time. The comment on the set even said "Keep in sync with services/search/pkg/bleve/index.go NewMapping" — a classic drift hazard that's now resolved: the same overrides map that drives the index mapping also drives the compiler. A field ends up in the set if its override selects the lowercase- Keyword analyzer or the fulltext type (whose analyzer lowercases in its token filter). Anything else preserves case both at index time and query time. Behavior unchanged for the current overrides (Name, Tags, Favorites, Content); a future override that picks a lowercase- ing analyzer will get the matching query-side fold for free.

sonarqubecloud · 2026-05-05T12:11:00Z

Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

dschmidt force-pushed the refactor/search-mapping branch from ae6414c to 531fd21 Compare April 22, 2026 23:33

dschmidt changed the title ~~refactor(search): build index mappings via reflection + generic helpers~~ refactor: reflection-based search mapping + location geopoint Apr 23, 2026

dschmidt mentioned this pull request Apr 23, 2026

docs(adr): propose unified search index mapping #2662

Open

dschmidt deleted the branch opencloud-eu:main May 5, 2026 10:35

dschmidt closed this May 5, 2026

dschmidt reopened this May 5, 2026

dschmidt changed the base branch from fix/search-preserve-value-case to main May 5, 2026 10:56

dschmidt added 15 commits May 5, 2026 14:00

dschmidt added 7 commits May 5, 2026 14:08

dschmidt force-pushed the refactor/search-mapping branch from 87589de to a25f04a Compare May 5, 2026 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: reflection-based search mapping + location geopoint#2659

refactor: reflection-based search mapping + location geopoint#2659
dschmidt wants to merge 22 commits intoopencloud-eu:mainfrom
dschmidt:refactor/search-mapping

dschmidt commented Apr 22, 2026 •

edited

Loading

Uh oh!

dragonchaser commented Apr 29, 2026

Uh oh!

dschmidt commented Apr 30, 2026 •

edited

Loading

Uh oh!

dschmidt commented Apr 30, 2026

Uh oh!

sonarqubecloud Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dschmidt commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior changes (deliberate)

Uh oh!

dragonchaser commented Apr 29, 2026

Uh oh!

dschmidt commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dschmidt commented Apr 30, 2026

Uh oh!

sonarqubecloud Bot commented May 5, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dschmidt commented Apr 22, 2026 •

edited

Loading

dschmidt commented Apr 30, 2026 •

edited

Loading