Skip to content

feat(rewrite): small-model drift tolerance + server-side disposition#175

Draft
lipikaramaswamy wants to merge 2 commits into
lipikaramaswamy/feat/small-model-detection-schemasfrom
lipikaramaswamy/feat/small-model-rewrite-schemas
Draft

feat(rewrite): small-model drift tolerance + server-side disposition#175
lipikaramaswamy wants to merge 2 commits into
lipikaramaswamy/feat/small-model-detection-schemasfrom
lipikaramaswamy/feat/small-model-rewrite-schemas

Conversation

@lipikaramaswamy

Copy link
Copy Markdown
Collaborator

Summary

Second of two stacked PRs for small-model schema robustness. Stacked on #174 (detection schemas) — review/merge that first; this PR's base is the detection branch and its diff shows only the rewrite-side changes. Supersedes the rewrite portion of #130.

Note: base will be retargeted to main automatically once #174 merges.

Changes

  • shared: loose-list-wrapper helpers so DataDesigner's jsonschema.validate() pre-check accepts a bare top-level list as well as the canonical wrapper (small models routinely emit the bare-list shape).
  • Wire schemas (DomainClassificationSchema, MeaningUnitSchema, QA answer schemas, SimpleDispositionItem/Result): typed as str with before-validators that coerce enum/scalar drift into range.
  • Two-step disposition: SensitivityDispositionWorkflow now emits a loose SimpleDispositionResult then reconstructs the strict EntityDispositionSchema server-side from trusted entity context (disposition_derivation.py), with a pessimistic fallback that prevents whole-record drops.

Schema-constraint cleanups (per review)

  • Drop min_length on server-reconstructed EntityDisposition.entity_label/entity_value/protection_reason — built from trusted context, not raw model output; the bounds were redundant tripwires.
  • Drop protection_reason max_length; instead cap the passthrough reason in the reconstructor (silent truncate to 500 chars) so a rambling reason can neither drop the record nor flow unbounded into the rewrite prompt/parquet.
  • Drop the now-redundant min/max_length on PrivacyAnswerItem.reason (the _truncate_reason before-validator is the sole, always-applied guard).
  • Keep the two list-level min_length=1 tripwires on the strict disposition containers: they never see raw model output and assert a real pipeline invariant (non-empty disposition when entities were detected).

Description cleanups

Wire-loose str fields (domain, aspect, importance, category, sensitivity, protection_method_suggestion) now enumerate their valid values inline in Field(description=...), derived from the backing enum (drift-proof), instead of referencing an unresolvable Python path — the enum is absent from the JSON schema the model sees, so the description is its only source of truth.

Test plan

  • make test green (899 passing on the stacked branch)
  • make format-check clean
  • New test_disposition_reconstructor, test_entity_label_category_map, rewrite drift classes; test_schemas/test_sensitivity_disposition updated for the two-step pipeline; added a regression test for the reconstructor reason cap
  • Live small-model run (rewrite/disposition/repair paths) before merge

@lipikaramaswamy lipikaramaswamy requested a review from a team as a code owner June 1, 2026 12:12
@lipikaramaswamy lipikaramaswamy force-pushed the lipikaramaswamy/feat/small-model-rewrite-schemas branch from 09d631c to 2ca00d4 Compare June 1, 2026 12:16
@lipikaramaswamy lipikaramaswamy changed the title feat(rewrite): tolerate small-model drift on rewrite schemas + server-side disposition feat(rewrite): small-model drift tolerance + server-side disposition Jun 1, 2026
@greptile-apps

greptile-apps Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces small-model drift tolerance for the rewrite pipeline by replacing the strict LLM wire schema with a loose SimpleDispositionResult + a deterministic server-side reconstruction step (disposition_derivation.py). It also widens several other schemas (DomainClassificationSchema, MeaningUnitSchema, QA answer schemas) to coerce common small-model output drift instead of raising.

  • Two-step disposition pipeline: LLM now emits SimpleDispositionResult (no enums/required/minLength on the wire); a CustomColumnConfig reconstruction column rebuilds the strict SensitivityDispositionSchema deterministically from entity context, with a pessimistic fallback to prevent whole-record drops.
  • Schema loosening across rewrite: DomainClassificationSchema, MeaningUnitSchema, PrivacyAnswerItemSchema, and the QA compare/quality schemas now coerce drift (casing, bare lists, missing ids, overlong reasons) instead of raising ValidationError.
  • _ENTITY_LABEL_TO_CATEGORY map: New module-level mapping from entity labels to category strings, with a CI guard test ensuring all DEFAULT_ENTITY_LABELS have entries.

Confidence Score: 4/5

Safe to merge after addressing the _coerce_unit container guard; all other changes are well-tested and correctly implement the two-step drift-tolerance pipeline.

The two-step disposition pipeline and the schema-widening changes are solid and well-covered by the new test suite. There is one gap in MeaningUnitSchema._coerce_unit: a small model that emits a list or dict for the unit field would cause a Pydantic ValidationError on that schema item, potentially dropping the meaning-unit record — the same class of failure the entire PR exists to prevent. The fix is a one-liner matching the already-applied guard in SimpleDispositionItem._coerce_scalar_to_str.

src/anonymizer/engine/schemas/rewrite.py — MeaningUnitSchema._coerce_unit

Important Files Changed

Filename Overview
src/anonymizer/engine/schemas/rewrite.py Large schema update: adds SimpleDispositionItem/Result, widens DomainClassificationSchema and MeaningUnitSchema, loosens EntityDispositionSchema constraints, normalizes QA answer schemas. MeaningUnitSchema._coerce_unit does not handle container inputs, which can still cause a ValidationError when a model emits a list/dict for the unit field.
src/anonymizer/engine/rewrite/disposition_derivation.py New module; implements category/method normalization, risk-level derivation, reason templating, entity-context flattening, and the full disposition reconstructor. Logic is well-documented and covered by tests.
src/anonymizer/engine/rewrite/sensitivity_disposition.py Adds _pessimistic_fallback_disposition and _reconstruct_full_disposition_column; updates SensitivityDispositionWorkflow.columns() to emit the two-step pipeline. Fallback paths are correctly guarded and tested.
src/anonymizer/engine/schemas/shared.py Adds loose_list_wrapper_json_schema and accept_bare_list helpers; graceful fallback when pydantic's inline schema is behind a $ref.
tests/engine/test_disposition_reconstructor.py New test file; comprehensive coverage of all reconstructor helpers, fallback paths, and the round-trip parse contract.
tests/engine/test_small_model_drift.py Extended with rewrite-side drift classes covering DomainClassificationSchema, SimpleDispositionResult, MeaningUnitsSchema, and PrivacyAnswerItemSchema.
tests/engine/test_schemas.py Updated QA coverage tests: raises-on-missing-id tests replaced with padding tests reflecting the new best-effort normalization contract.
tests/engine/test_entity_label_category_map.py CI guard: verifies every DEFAULT_ENTITY_LABELS entry has a category mapping and every map value is a valid EntityCategory.
src/anonymizer/engine/constants.py Adds COL_SIMPLE_DISPOSITION constant for the internal LLM hand-off column.

Sequence Diagram

sequenceDiagram
    participant DD as DataDesigner
    participant LLM as disposition_analyzer LLM
    participant Recon as _reconstruct_full_disposition_column
    participant Fallback as _pessimistic_fallback_disposition
    participant Down as Downstream consumers

    DD->>LLM: prompt (SimpleDispositionResult wire schema)
    LLM-->>DD: "bare list OR {sensitivity_disposition:[...]} (drifted OK)"
    DD->>DD: jsonschema.validate() — oneOf(wrapper, array) passes
    DD->>DD: pydantic: accept_bare_list + _coerce_scalar_to_str
    DD-->>DD: "COL_SIMPLE_DISPOSITION = SimpleDispositionResult"

    DD->>Recon: row[COL_SIMPLE_DISPOSITION, COL_ENTITIES_BY_VALUE, COL_LATENT_ENTITIES]
    alt simple items present
        Recon->>Recon: reconstruct_full_disposition(simple, ctx)
        alt all items orphan → ValidationError
            Recon->>Fallback: _pessimistic_fallback_disposition(ctx)
            Fallback-->>Recon: SensitivityDispositionSchema (replace/generalize)
        end
    else empty simple output
        Recon->>Fallback: _pessimistic_fallback_disposition(ctx)
        Fallback-->>Recon: SensitivityDispositionSchema (noop if ctx empty)
    end
    Recon-->>DD: "row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()"
    DD->>Down: COL_SENSITIVITY_DISPOSITION (strict SensitivityDispositionSchema)
Loading

Reviews (3): Last reviewed commit: "fix(rewrite): close record-drop gaps fla..." | Re-trigger Greptile

Comment on lines +391 to +406
if not simple.sensitivity_disposition:
logger.warning(
"reconstruct: empty SimpleDispositionResult for row; "
"falling back to pessimistic disposition from entity context"
)
full = _pessimistic_fallback_disposition(entities_by_value, latent_entities)
else:
try:
full = reconstruct_full_disposition(simple, entities_by_value, latent_entities)
except ValidationError as exc:
logger.warning(
"reconstruct: ValidationError after orphan-skipping (likely all items out of context range); "
"falling back to pessimistic disposition. detail=%s",
str(exc)[:200],
)
full = _pessimistic_fallback_disposition(entities_by_value, latent_entities)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Uncaught ValidationError from _pessimistic_fallback_disposition can still drop the row

Both call-sites of _pessimistic_fallback_disposition inside this function are unguarded. When context is empty (all slots have empty labels/values — or there are no context rows at all), _pessimistic_fallback_disposition calls SensitivityDispositionSchema(sensitivity_disposition=[]), which raises a ValidationError from the min_length=1 constraint. Because neither call-site is wrapped in its own try/except, that exception propagates out of _reconstruct_full_disposition_column uncaught, and the row still drops — contradicting the PR's stated guarantee of preventing whole-record drops.

The try/except on line 398–406 only covers reconstruct_full_disposition; both subsequent _pessimistic_fallback_disposition calls (line 396 and line 406) are outside its scope. Wrapping them as well (or catching at the column level) would close the gap.

Comment on lines +337 to +343
if not items:
# Genuinely no entities at all in context — the orchestrator should
# have short-circuited before this step. Still better to raise here
# than to silently emit garbage; SensitivityDispositionSchema's
# min_length=1 invariant will surface the bug.
return SensitivityDispositionSchema(sensitivity_disposition=items)
return SensitivityDispositionSchema(sensitivity_disposition=items)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate return statements — the if not items: branch and the fallthrough return are identical. The comment says "Still better to raise here" but neither branch explicitly raises; the raise comes implicitly from SensitivityDispositionSchema validation. Since the two paths return the same expression, the if branch is dead code. Either remove the if not items: block (letting validation raise naturally), or replace it with an explicit raise to make the intent clear and avoid the misleading comment.

Suggested change
if not items:
# Genuinely no entities at all in context — the orchestrator should
# have short-circuited before this step. Still better to raise here
# than to silently emit garbage; SensitivityDispositionSchema's
# min_length=1 invariant will surface the bug.
return SensitivityDispositionSchema(sensitivity_disposition=items)
return SensitivityDispositionSchema(sensitivity_disposition=items)
# Genuinely no entities at all in context — the orchestrator should
# have short-circuited before this step. SensitivityDispositionSchema's
# min_length=1 invariant will surface the bug via ValidationError.
return SensitivityDispositionSchema(sensitivity_disposition=items)

Comment on lines +464 to +470
@classmethod
def _coerce_scalar_to_str(cls, v: object) -> str:
if v is None:
return ""
if isinstance(v, (int, float, bool)):
return str(v)
return v

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _coerce_scalar_to_str falls through for any type other than None, int, float, or bool — including list and dict. If a small model emits a list value for entity_label or category, the validator returns the list unchanged, and Pydantic v2 will raise a ValidationError for the entire SimpleDispositionItem rather than gracefully coercing to "". Adding a final string-cast fallback would match the validator's intent of making all scalar (and unexpectedly typed) values survive.

Suggested change
@classmethod
def _coerce_scalar_to_str(cls, v: object) -> str:
if v is None:
return ""
if isinstance(v, (int, float, bool)):
return str(v)
return v
@classmethod
def _coerce_scalar_to_str(cls, v: object) -> str:
if v is None:
return ""
if isinstance(v, str):
return v
if isinstance(v, (int, float, bool)):
return str(v)
# Unexpected type (list, dict, etc.) from a drifted model response —
# coerce to empty string rather than letting pydantic raise on str validation.
return ""

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@lipikaramaswamy lipikaramaswamy force-pushed the lipikaramaswamy/feat/small-model-rewrite-schemas branch 2 times, most recently from 34004d2 to 5d6dd38 Compare June 1, 2026 12:26
…-side disposition

Stacked on the detection-schema PR. Loosen the LLM-facing rewrite wire
schemas and move disposition to a two-step (loose-wire → server-reconstruct)
pipeline so small models no longer drop records:

- shared: loose-list-wrapper helpers so DD's jsonschema pre-check accepts a
  bare top-level list as well as the canonical wrapper.
- domain/meaning-unit/QA/disposition wire schemas typed as str with
  before-validators that coerce enum/scalar drift into range.
- SimpleDispositionResult/Item: loose wire contract; reconstruct_full_disposition
  pairs it with trusted entity context to build the strict EntityDispositionSchema,
  with a pessimistic fallback that prevents whole-record drops.

Schema-constraint cleanups (per review):
- Drop min_length on server-reconstructed EntityDisposition.entity_label /
  entity_value / protection_reason (these are built from trusted context, not
  raw model output; the bounds were redundant tripwires).
- Drop protection_reason max_length and instead cap a passthrough reason in the
  reconstructor (silent truncate to 500 chars) so a rambling reason can neither
  drop the record nor flow unbounded into the rewrite prompt/parquet.
- Drop the now-redundant min/max_length on PrivacyAnswerItem.reason (the
  _truncate_reason before-validator is the sole, always-applied guard).
- Keep the two list-level min_length=1 tripwires on the strict disposition
  containers: they never see raw model output and assert a real pipeline
  invariant (non-empty disposition when entities were detected).

Description cleanups: wire-loose str fields (domain, aspect, importance,
category, sensitivity, protection_method_suggestion) now enumerate their valid
values inline in Field(description=...), derived from the backing enum, instead
of referencing an unresolvable Python path — the enum is absent from the JSON
schema the model sees, so the description is its only source of truth.
Two small-model robustness gaps surfaced by Greptile on the disposition path:

- _pessimistic_fallback_disposition could still drop a row: with empty entity
  context it built an empty SensitivityDispositionSchema, which raises on the
  min_length=1 tripwire (and again in the downstream parser), and both
  call-sites in _reconstruct_full_disposition_column were unguarded. It now
  emits a single no-op (leave_as_is/low) disposition in that pipeline-invariant
  case, logging loudly — the row survives and the no-op never reaches the
  rewrite (excluded from protected_entities). Also removes the dead duplicate
  return branch.
- SimpleDispositionItem._coerce_scalar_to_str now coerces unexpected container
  types (list/dict) to "" instead of returning them unchanged, so one drifted
  field no longer fails the whole item (which would discard every disposition
  for the row and force a pessimistic fallback).

Adds regression tests for the empty-context no-op guarantee (function + column
level) and container-value coercion.
@lipikaramaswamy lipikaramaswamy force-pushed the lipikaramaswamy/feat/small-model-rewrite-schemas branch from 5d6dd38 to 1953d43 Compare June 1, 2026 13:37
Comment on lines +591 to +598
@field_validator("unit", mode="before")
@classmethod
def _coerce_unit(cls, v: object) -> str:
if v is None:
return ""
if isinstance(v, (int, float, bool)):
return str(v)
return v

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _coerce_unit returns container values (list/dict) unchanged, so Pydantic raises a ValidationError for the whole MeaningUnitSchema item when a small model emits a list or dict for unit. The analogous validator on SimpleDispositionItem._coerce_scalar_to_str was fixed to return "" for containers — the same guard is needed here to keep a single drifted unit from dropping the entire meaning-units record.

Suggested change
@field_validator("unit", mode="before")
@classmethod
def _coerce_unit(cls, v: object) -> str:
if v is None:
return ""
if isinstance(v, (int, float, bool)):
return str(v)
return v
@field_validator("unit", mode="before")
@classmethod
def _coerce_unit(cls, v: object) -> str:
if v is None:
return ""
if isinstance(v, (int, float, bool)):
return str(v)
if not isinstance(v, str):
# Unexpected container (list/dict) from a drifted response: coerce
# to "" rather than letting pydantic raise on the whole MeaningUnitSchema.
return ""
return v

@lipikaramaswamy lipikaramaswamy marked this pull request as draft June 2, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant