Skip to content

Write-back escape hatch: persist derived facts as sidecar :CldkFact annotation nodes (Neo4j) #190

Description

@rahlk

Is your feature request related to a problem? Please describe.

The Neo4j backends are strictly read-only — they poll a graph populated out of band and never write. But consumers (triage/agent workflows) increasingly compute derived facts about symbols — a risk score, a "reviewed" flag, a label, a provenance note — and have nowhere to put them. Today the only options are to keep that state in a side store (loses the graph join) or hand-write Cypher (leaks schema, unconstrained, can clobber analyzer data).

We want a general but constrained escape hatch to write facts back, keyed to the symbols they describe — without turning the read client into an unconstrained graph editor and without ever mutating analyzer-emitted nodes/properties.

Describe the solution you'd like

A small, opt-in, namespaced write-back surface on the Neo4j backend that persists facts as sidecar annotation nodes, never touching analyzer-owned data. Symbols are already signature-keyed (:PySymbol/:TSSymbol/Java symbol), so facts attach off the signature.

Schema (sidecar nodes — the chosen shape)

(s {signature})-[:CLDK_FACT]->(:CldkFact { key, value, source, created_at })
  • One :CldkFact per (symbol, key); writes upsert via MERGE so re-writing a key updates in place.
  • value stored as a string (optionally value_type for round-tripping non-strings, or JSON-encode); source is free-form provenance (e.g. the agent/run name); created_at set server-side via Cypher datetime().
  • App-scoped: the fact rides the symbol's existing application scope (matched within _module IN $mods); stamp the owning _module/app on :CldkFact too so facts are isolable and bulk-removable per application.
  • Reserved namespace (:CldkFact label + CLDK_FACT relationship) guarantees the analyzer can re-emit and the SDK can re-write without either clobbering the other.

Write API (Neo4j backend; opt-in)

set_fact(signature, key, value, *, source=None)      # upsert one
set_facts(signature, {k: v, ...}, *, source=None)     # upsert many on one symbol
set_facts_for({signature: {k: v}}, *, source=None)    # bulk, batched in one statement
get_facts(signature) -> dict                          # read back
unset_fact(signature, key)                            # remove one fact from a symbol
unset_facts(signature, keys=None)                     # remove several (or all on the symbol when keys is None)
clear_cldk_facts()                                    # remove ALL CLDK facts across the application (quick reset)

This is the only mutation path; everything else stays read-only. The removal calls only touch
the cldk.* namespace: unset_fact* delete the matching :CldkFact nodes for a symbol, and
clear_cldk_facts() deletes every :CldkFact reachable within this application's scope
(_module IN $mods) — never analyzer nodes, and never another application's facts.

Read side — hydrate, don't pollute the read schema

Add a facts: dict[str, Any] = {} field to the cldk-owned projection models (PyCallableOverview, and the forthcoming TSCallableOverview/JCallableOverview from #189) and populate it from :CldkFact. The upstream analyzer models (PyCallable, etc., owned by codeanalyzer-python) are left untouched — we don't fork their schema.

In-process backend

No persistent store, so the write methods raise a clear NotSupportedError ("fact write-back requires the Neo4j backend"). Keeps the ABC honest without pretending to persist.

Describe alternatives you've considered

  • Reserved property namespace (SET s += {cldk_facts: '<json>'} on the node) — simplest and co-located, but mutates analyzer nodes (re-emit can clobber), carries no provenance, and no history. Rejected for the agent-facts use case where separability + provenance matter.
  • A context/metadata field on the dataclasses (the original idea) — can't be added to the upstream Python models without forking codeanalyzer-python's schema; conflates read-schema with write-payload; and leaves persistence semantics (where/when/how it's stored) undefined. The facts hydration above gives the same ergonomics on the read side without these problems.

Additional context

  • Cross-language by construction: :PySymbol / :TSSymbol / the Java symbol are all signature-keyed, so the same :CldkFact pattern applies to all three Neo4j backends (PyNeo4jBackend, TSNeo4jBackend, and the Java Neo4j backend). Suggested sequencing: prototype on Python first, then mirror.
  • Deliberately breaks the read-only invariant in one clearly-separated, documented place — keep it as an explicit writer surface (e.g. a facts sub-API or a writer mixin), not sprinkled into the read methods.
  • Pairs with Bulk/projected accessors: extend across entities (classes/modules/fields/hierarchy/graph) and to TypeScript & Java #189 (the projection models gain the facts field).
  • Open sub-questions for the design review: typed values vs string/JSON; whether to allow fact-bearing nodes/edges beyond per-symbol (e.g. facts on call edges); and whether :CldkFact should be uniquely constrained per (symbol,key) via a Neo4j constraint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions