feat(rewrite): small-model drift tolerance + server-side disposition by lipikaramaswamy · Pull Request #175 · NVIDIA-NeMo/Anonymizer

lipikaramaswamy · 2026-06-01T12:12:42Z

Summary

Second of two stacked PRs for small-model schema robustness. Stacked on #174 (detection schemas) — review/merge that first; this PR's base is the detection branch and its diff shows only the rewrite-side changes. Supersedes the rewrite portion of #130.

Note: base will be retargeted to main automatically once #174 merges.

Changes

shared: loose-list-wrapper helpers so DataDesigner's jsonschema.validate() pre-check accepts a bare top-level list as well as the canonical wrapper (small models routinely emit the bare-list shape).
Wire schemas (DomainClassificationSchema, MeaningUnitSchema, QA answer schemas, SimpleDispositionItem/Result): typed as str with before-validators that coerce enum/scalar drift into range.
Two-step disposition: SensitivityDispositionWorkflow now emits a loose SimpleDispositionResult then reconstructs the strict EntityDispositionSchema server-side from trusted entity context (disposition_derivation.py), with a pessimistic fallback that prevents whole-record drops.

Schema-constraint cleanups (per review)

Drop min_length on server-reconstructed EntityDisposition.entity_label/entity_value/protection_reason — built from trusted context, not raw model output; the bounds were redundant tripwires.
Drop protection_reason max_length; instead cap the passthrough reason in the reconstructor (silent truncate to 500 chars) so a rambling reason can neither drop the record nor flow unbounded into the rewrite prompt/parquet.
Drop the now-redundant min/max_length on PrivacyAnswerItem.reason (the _truncate_reason before-validator is the sole, always-applied guard).
Keep the two list-level min_length=1 tripwires on the strict disposition containers: they never see raw model output and assert a real pipeline invariant (non-empty disposition when entities were detected).

Description cleanups

Wire-loose str fields (domain, aspect, importance, category, sensitivity, protection_method_suggestion) now enumerate their valid values inline in Field(description=...), derived from the backing enum (drift-proof), instead of referencing an unresolvable Python path — the enum is absent from the JSON schema the model sees, so the description is its only source of truth.

Test plan

make test green (899 passing on the stacked branch)
make format-check clean
New test_disposition_reconstructor, test_entity_label_category_map, rewrite drift classes; test_schemas/test_sensitivity_disposition updated for the two-step pipeline; added a regression test for the reconstructor reason cap
Live small-model run (rewrite/disposition/repair paths) before merge

greptile-apps · 2026-06-01T12:21:29Z

Greptile Summary

This PR introduces small-model drift tolerance for the rewrite pipeline by replacing the strict LLM wire schema with a loose SimpleDispositionResult + a deterministic server-side reconstruction step (disposition_derivation.py). It also widens several other schemas (DomainClassificationSchema, MeaningUnitSchema, QA answer schemas) to coerce common small-model output drift instead of raising.

Two-step disposition pipeline: LLM now emits SimpleDispositionResult (no enums/required/minLength on the wire); a CustomColumnConfig reconstruction column rebuilds the strict SensitivityDispositionSchema deterministically from entity context, with a pessimistic fallback to prevent whole-record drops.
Schema loosening across rewrite: DomainClassificationSchema, MeaningUnitSchema, PrivacyAnswerItemSchema, and the QA compare/quality schemas now coerce drift (casing, bare lists, missing ids, overlong reasons) instead of raising ValidationError.
_ENTITY_LABEL_TO_CATEGORY map: New module-level mapping from entity labels to category strings, with a CI guard test ensuring all DEFAULT_ENTITY_LABELS have entries.

Confidence Score: 4/5

Safe to merge after addressing the _coerce_unit container guard; all other changes are well-tested and correctly implement the two-step drift-tolerance pipeline.

The two-step disposition pipeline and the schema-widening changes are solid and well-covered by the new test suite. There is one gap in MeaningUnitSchema._coerce_unit: a small model that emits a list or dict for the unit field would cause a Pydantic ValidationError on that schema item, potentially dropping the meaning-unit record — the same class of failure the entire PR exists to prevent. The fix is a one-liner matching the already-applied guard in SimpleDispositionItem._coerce_scalar_to_str.

src/anonymizer/engine/schemas/rewrite.py — MeaningUnitSchema._coerce_unit

Important Files Changed

Filename	Overview
src/anonymizer/engine/schemas/rewrite.py	Large schema update: adds SimpleDispositionItem/Result, widens DomainClassificationSchema and MeaningUnitSchema, loosens EntityDispositionSchema constraints, normalizes QA answer schemas. MeaningUnitSchema._coerce_unit does not handle container inputs, which can still cause a ValidationError when a model emits a list/dict for the unit field.
src/anonymizer/engine/rewrite/disposition_derivation.py	New module; implements category/method normalization, risk-level derivation, reason templating, entity-context flattening, and the full disposition reconstructor. Logic is well-documented and covered by tests.
src/anonymizer/engine/rewrite/sensitivity_disposition.py	Adds _pessimistic_fallback_disposition and _reconstruct_full_disposition_column; updates SensitivityDispositionWorkflow.columns() to emit the two-step pipeline. Fallback paths are correctly guarded and tested.
src/anonymizer/engine/schemas/shared.py	Adds loose_list_wrapper_json_schema and accept_bare_list helpers; graceful fallback when pydantic's inline schema is behind a $ref.
tests/engine/test_disposition_reconstructor.py	New test file; comprehensive coverage of all reconstructor helpers, fallback paths, and the round-trip parse contract.
tests/engine/test_small_model_drift.py	Extended with rewrite-side drift classes covering DomainClassificationSchema, SimpleDispositionResult, MeaningUnitsSchema, and PrivacyAnswerItemSchema.
tests/engine/test_schemas.py	Updated QA coverage tests: raises-on-missing-id tests replaced with padding tests reflecting the new best-effort normalization contract.
tests/engine/test_entity_label_category_map.py	CI guard: verifies every DEFAULT_ENTITY_LABELS entry has a category mapping and every map value is a valid EntityCategory.
src/anonymizer/engine/constants.py	Adds COL_SIMPLE_DISPOSITION constant for the internal LLM hand-off column.

Sequence Diagram

sequenceDiagram
    participant DD as DataDesigner
    participant LLM as disposition_analyzer LLM
    participant Recon as _reconstruct_full_disposition_column
    participant Fallback as _pessimistic_fallback_disposition
    participant Down as Downstream consumers

    DD->>LLM: prompt (SimpleDispositionResult wire schema)
    LLM-->>DD: "bare list OR {sensitivity_disposition:[...]} (drifted OK)"
    DD->>DD: jsonschema.validate() — oneOf(wrapper, array) passes
    DD->>DD: pydantic: accept_bare_list + _coerce_scalar_to_str
    DD-->>DD: "COL_SIMPLE_DISPOSITION = SimpleDispositionResult"

    DD->>Recon: row[COL_SIMPLE_DISPOSITION, COL_ENTITIES_BY_VALUE, COL_LATENT_ENTITIES]
    alt simple items present
        Recon->>Recon: reconstruct_full_disposition(simple, ctx)
        alt all items orphan → ValidationError
            Recon->>Fallback: _pessimistic_fallback_disposition(ctx)
            Fallback-->>Recon: SensitivityDispositionSchema (replace/generalize)
        end
    else empty simple output
        Recon->>Fallback: _pessimistic_fallback_disposition(ctx)
        Fallback-->>Recon: SensitivityDispositionSchema (noop if ctx empty)
    end
    Recon-->>DD: "row[COL_SENSITIVITY_DISPOSITION] = full.model_dump()"
    DD->>Down: COL_SENSITIVITY_DISPOSITION (strict SensitivityDispositionSchema)

_{Reviews (3): Last reviewed commit: "fix(rewrite): close record-drop gaps fla..." | Re-trigger Greptile}

greptile-apps · 2026-06-01T12:21:33Z

+    if not simple.sensitivity_disposition:
+        logger.warning(
+            "reconstruct: empty SimpleDispositionResult for row; "
+            "falling back to pessimistic disposition from entity context"
+        )
+        full = _pessimistic_fallback_disposition(entities_by_value, latent_entities)
+    else:
+        try:
+            full = reconstruct_full_disposition(simple, entities_by_value, latent_entities)
+        except ValidationError as exc:
+            logger.warning(
+                "reconstruct: ValidationError after orphan-skipping (likely all items out of context range); "
+                "falling back to pessimistic disposition. detail=%s",
+                str(exc)[:200],
+            )
+            full = _pessimistic_fallback_disposition(entities_by_value, latent_entities)


Uncaught ValidationError from _pessimistic_fallback_disposition can still drop the row

Both call-sites of _pessimistic_fallback_disposition inside this function are unguarded. When context is empty (all slots have empty labels/values — or there are no context rows at all), _pessimistic_fallback_disposition calls SensitivityDispositionSchema(sensitivity_disposition=[]), which raises a ValidationError from the min_length=1 constraint. Because neither call-site is wrapped in its own try/except, that exception propagates out of _reconstruct_full_disposition_column uncaught, and the row still drops — contradicting the PR's stated guarantee of preventing whole-record drops.

The try/except on line 398–406 only covers reconstruct_full_disposition; both subsequent _pessimistic_fallback_disposition calls (line 396 and line 406) are outside its scope. Wrapping them as well (or catching at the column level) would close the gap.

greptile-apps · 2026-06-01T12:21:33Z

+    if not items:
+        # Genuinely no entities at all in context — the orchestrator should
+        # have short-circuited before this step. Still better to raise here
+        # than to silently emit garbage; SensitivityDispositionSchema's
+        # min_length=1 invariant will surface the bug.
+        return SensitivityDispositionSchema(sensitivity_disposition=items)
+    return SensitivityDispositionSchema(sensitivity_disposition=items)


Duplicate return statements — the if not items: branch and the fallthrough return are identical. The comment says "Still better to raise here" but neither branch explicitly raises; the raise comes implicitly from SensitivityDispositionSchema validation. Since the two paths return the same expression, the if branch is dead code. Either remove the if not items: block (letting validation raise naturally), or replace it with an explicit raise to make the intent clear and avoid the misleading comment.

Suggested change

if not items:

# Genuinely no entities at all in context — the orchestrator should

# have short-circuited before this step. Still better to raise here

# than to silently emit garbage; SensitivityDispositionSchema's

# min_length=1 invariant will surface the bug.

return SensitivityDispositionSchema(sensitivity_disposition=items)

return SensitivityDispositionSchema(sensitivity_disposition=items)

# Genuinely no entities at all in context — the orchestrator should

# have short-circuited before this step. SensitivityDispositionSchema's

# min_length=1 invariant will surface the bug via ValidationError.

return SensitivityDispositionSchema(sensitivity_disposition=items)

greptile-apps · 2026-06-01T12:21:34Z

+    @classmethod
+    def _coerce_scalar_to_str(cls, v: object) -> str:
+        if v is None:
+            return ""
+        if isinstance(v, (int, float, bool)):
+            return str(v)
+        return v


_coerce_scalar_to_str falls through for any type other than None, int, float, or bool — including list and dict. If a small model emits a list value for entity_label or category, the validator returns the list unchanged, and Pydantic v2 will raise a ValidationError for the entire SimpleDispositionItem rather than gracefully coercing to "". Adding a final string-cast fallback would match the validator's intent of making all scalar (and unexpectedly typed) values survive.

Suggested change

@classmethod

def _coerce_scalar_to_str(cls, v: object) -> str:

if v is None:

return ""

if isinstance(v, (int, float, bool)):

return str(v)

return v

@classmethod

def _coerce_scalar_to_str(cls, v: object) -> str:

if v is None:

return ""

if isinstance(v, str):

return v

if isinstance(v, (int, float, bool)):

return str(v)

# Unexpected type (list, dict, etc.) from a drifted model response —

# coerce to empty string rather than letting pydantic raise on str validation.

return ""

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…-side disposition Stacked on the detection-schema PR. Loosen the LLM-facing rewrite wire schemas and move disposition to a two-step (loose-wire → server-reconstruct) pipeline so small models no longer drop records: - shared: loose-list-wrapper helpers so DD's jsonschema pre-check accepts a bare top-level list as well as the canonical wrapper. - domain/meaning-unit/QA/disposition wire schemas typed as str with before-validators that coerce enum/scalar drift into range. - SimpleDispositionResult/Item: loose wire contract; reconstruct_full_disposition pairs it with trusted entity context to build the strict EntityDispositionSchema, with a pessimistic fallback that prevents whole-record drops. Schema-constraint cleanups (per review): - Drop min_length on server-reconstructed EntityDisposition.entity_label / entity_value / protection_reason (these are built from trusted context, not raw model output; the bounds were redundant tripwires). - Drop protection_reason max_length and instead cap a passthrough reason in the reconstructor (silent truncate to 500 chars) so a rambling reason can neither drop the record nor flow unbounded into the rewrite prompt/parquet. - Drop the now-redundant min/max_length on PrivacyAnswerItem.reason (the _truncate_reason before-validator is the sole, always-applied guard). - Keep the two list-level min_length=1 tripwires on the strict disposition containers: they never see raw model output and assert a real pipeline invariant (non-empty disposition when entities were detected). Description cleanups: wire-loose str fields (domain, aspect, importance, category, sensitivity, protection_method_suggestion) now enumerate their valid values inline in Field(description=...), derived from the backing enum, instead of referencing an unresolvable Python path — the enum is absent from the JSON schema the model sees, so the description is its only source of truth.

Two small-model robustness gaps surfaced by Greptile on the disposition path: - _pessimistic_fallback_disposition could still drop a row: with empty entity context it built an empty SensitivityDispositionSchema, which raises on the min_length=1 tripwire (and again in the downstream parser), and both call-sites in _reconstruct_full_disposition_column were unguarded. It now emits a single no-op (leave_as_is/low) disposition in that pipeline-invariant case, logging loudly — the row survives and the no-op never reaches the rewrite (excluded from protected_entities). Also removes the dead duplicate return branch. - SimpleDispositionItem._coerce_scalar_to_str now coerces unexpected container types (list/dict) to "" instead of returning them unchanged, so one drifted field no longer fails the whole item (which would discard every disposition for the row and force a pessimistic fallback). Adds regression tests for the empty-context no-op guarantee (function + column level) and container-value coercion.

greptile-apps · 2026-06-01T13:42:58Z

+    @field_validator("unit", mode="before")
+    @classmethod
+    def _coerce_unit(cls, v: object) -> str:
+        if v is None:
+            return ""
+        if isinstance(v, (int, float, bool)):
+            return str(v)
+        return v


_coerce_unit returns container values (list/dict) unchanged, so Pydantic raises a ValidationError for the whole MeaningUnitSchema item when a small model emits a list or dict for unit. The analogous validator on SimpleDispositionItem._coerce_scalar_to_str was fixed to return "" for containers — the same guard is needed here to keep a single drifted unit from dropping the entire meaning-units record.

Suggested change

@field_validator("unit", mode="before")

@classmethod

def _coerce_unit(cls, v: object) -> str:

if v is None:

return ""

if isinstance(v, (int, float, bool)):

return str(v)

return v

@field_validator("unit", mode="before")

@classmethod

def _coerce_unit(cls, v: object) -> str:

if v is None:

return ""

if isinstance(v, (int, float, bool)):

return str(v)

if not isinstance(v, str):

# Unexpected container (list/dict) from a drifted response: coerce

# to "" rather than letting pydantic raise on the whole MeaningUnitSchema.

return ""

return v

lipikaramaswamy requested a review from a team as a code owner June 1, 2026 12:12

lipikaramaswamy force-pushed the lipikaramaswamy/feat/small-model-rewrite-schemas branch from 09d631c to 2ca00d4 Compare June 1, 2026 12:16

lipikaramaswamy mentioned this pull request Jun 1, 2026

feat: tolerate small-model drift on detection + replace schemas #174

Open

4 tasks

lipikaramaswamy changed the title ~~feat(rewrite): tolerate small-model drift on rewrite schemas + server-side disposition~~ feat(rewrite): small-model drift tolerance + server-side disposition Jun 1, 2026

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

lipikaramaswamy force-pushed the lipikaramaswamy/feat/small-model-rewrite-schemas branch 2 times, most recently from 34004d2 to 5d6dd38 Compare June 1, 2026 12:26

lipikaramaswamy added 2 commits June 1, 2026 14:35

lipikaramaswamy force-pushed the lipikaramaswamy/feat/small-model-rewrite-schemas branch from 5d6dd38 to 1953d43 Compare June 1, 2026 13:37

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

lipikaramaswamy marked this pull request as draft June 2, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rewrite): small-model drift tolerance + server-side disposition#175

feat(rewrite): small-model drift tolerance + server-side disposition#175
lipikaramaswamy wants to merge 2 commits into
lipikaramaswamy/feat/small-model-detection-schemasfrom
lipikaramaswamy/feat/small-model-rewrite-schemas

lipikaramaswamy commented Jun 1, 2026

Uh oh!

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lipikaramaswamy commented Jun 1, 2026

Summary

Changes

Schema-constraint cleanups (per review)

Description cleanups

Test plan

Uh oh!

greptile-apps Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading