Skip to content

Close the firewall redaction fail-open at the max-depth boundary #149

@dgenio

Description

@dgenio

Summary

Strengthen firewall/redaction.py so that data nested below the configured
max_depth is not returned to the LLM unredacted. Today recursion stops at the depth
cap and returns the remaining subtree verbatim.

Why this matters

The Firewall is the I-01 boundary: raw tool output must never reach the LLM without
passing a redaction step. A fail-open at depth means PII, secrets, or connection
strings nested four-or-more levels deep in a tool result flow straight through. For a
security kernel this is the highest-severity class of bug because it is silent and
contradicts the documented guarantee in docs/security.md.

Current evidence

  • firewall/redaction.py: redact() begins with if depth >= max_depth: return data, warnings — the subtree is returned without scanning strings or sensitive field names.
  • firewall/budgets.py: Budgets.max_depth defaults to 3.
  • firewall/transform.py calls redact(data, max_depth=self._budgets.max_depth) for every response mode.
  • docs/security.md lists "PII / PCI leakage → Redaction + allowed_fields enforcement in the firewall" as a mitigation with no depth caveat.

External context

Indirect prompt-injection and data-leakage research treats the tool-output boundary as
the control point; a redaction layer that stops at fixed depth is a known evasion path.

Proposed implementation

  1. Decide the safe-by-default behavior at depth limit: rather than returning the raw
    subtree, either (a) drop it and append a warning, or (b) replace it with a
    "[TRUNCATED: depth limit]" sentinel. Option (b) preserves shape; both are safe.
  2. Keep string-pattern redaction running at the boundary even when structural recursion
    stops, so leaf strings are still scanned.
  3. Make the choice explicit in Budgets (e.g., on_max_depth: "drop" | "truncate").
  4. Update docs/security.md and docs/context_firewall.md to state the behavior.

AI-agent execution notes

  • Inspect first: firewall/redaction.py, firewall/budgets.py, firewall/transform.py, tests/test_redaction.py, tests/test_firewall_boundary.py.
  • This file is security-grade; preserve all existing redaction paths, only close the gap.
  • Edge cases: cyclic structures are not expected (JSON), but very wide dicts at the boundary should not explode cost; recursion must still terminate.
  • Do not change max_depth's default value without separate discussion.

Acceptance criteria

  • A nested payload with an email/secret below max_depth returns [REDACTED]/dropped, never the raw value.
  • A warning is emitted describing the depth-limit action.
  • tests/test_firewall_boundary.py gains a negative assertion that a deep secret string never appears in any Frame.
  • Existing redaction tests still pass.

Test plan

Add deep-nesting cases to tests/test_redaction.py and tests/test_firewall_boundary.py
(fake secrets at depth 4–6). Run full make ci.

Documentation plan

Update docs/security.md redaction row and docs/context_firewall.md depth section;
CHANGELOG Fixed/Security entry.

Migration and compatibility notes

Behavior change for deeply nested outputs (previously leaked, now redacted/truncated).
Document under Security; not expected to require caller code changes.

Risks and tradeoffs

Truncating deep structure may drop data some callers relied on seeing; the handle
still references the full dataset for authorized expansion. Safety outweighs the
convenience cost.

Suggested labels

security, reliability, testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions