fix(storage): reject path-traversal artifact ids at the path chokepoint#301
fix(storage): reject path-traversal artifact ids at the path chokepoint#301minion1227 wants to merge 2 commits into
Conversation
`Source.id` is locked to a hex sha256, but `Claim`, `Page`, `Entity`, `Relation`, `Evidence`, `Session`, and `Proposal` all carry a bare `id: str` with no validator. `_yaml` / `_page_path` turned those ids straight into filesystem paths with no containment check, so a poisoned id — e.g. `"../../tmp/x"` smuggled through a bundle body that `import_apply` lands under a safe filename but whose in-memory id is traversal — could escape `kb_dir` on the next put / update / lifecycle write, or read an arbitrary file on get. add `_reject_unsafe_id`, mirroring `bundle._unsafe_name_reason`: an id must be non-empty and free of nul bytes, path separators, and `..` components. route it through `_yaml` and `_page_path` — the single chokepoint every `_*_path` helper funnels through — so all artifact read/write paths are guarded at once. a complementary model-layer id validator (mirroring `Source._id_is_hex_sha256`) is left as a follow-up; the storage guard already closes every reach path, including raw-id reads a model validator cannot see. closes vouchdev#149
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds a storage-layer unsafe-ID validator, wires it into YAML, page, and source path construction, expands regression coverage for rejected and valid IDs, and adds a changelog note describing the path-traversal hardening. ChangesArtifact ID Path Traversal Guard
Estimated code review effort: 2 (Simple) | ~12 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/vouch/storage.py`:
- Around line 257-265: The storage path validation is incomplete because source
paths are still built from untrusted source IDs without rejection. Update the
source lookup flow in Storage, especially get_source(), read_source_content(),
and _source_dir(), to apply _reject_unsafe_id() to source_id before constructing
paths so traversal inputs cannot escape kb_dir. Keep the fix aligned with the
existing _yaml(), _page_path(), and _claim_path() validation pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 868a3b3e-60b5-42f6-bee6-d17f55afff2f
📒 Files selected for processing (3)
CHANGELOG.mdsrc/vouch/storage.pytests/test_storage.py
address review on vouchdev#301: `_yaml` / `_page_path` were guarded, but `get_source` / `read_source_content` (and the evidence-ref existence checks) route raw `source_id` strings through `_source_dir` without it, so `get_source("../../outside")` could still escape kb_dir to `<outside>/meta.yaml` / `content`. `Source.id` is hex-locked on the model but these read paths take unvalidated strings. apply `_reject_unsafe_id` at the `_source_dir` chokepoint too, closing the last storage traversal path. add get_source / read_source_content rejection tests.
|
good catch — fixed in a5db33c. added |
What changed
Artifact ids are now rejected at the storage layer when they contain a path
separator,
.., a nul byte, or are empty. A new_reject_unsafe_idguardruns at the
_yaml/_page_pathchokepoint that every_*_pathhelperfunnels through, so all artifact read/write paths are covered at once.
Why
Source.idis locked to a hex sha256, butClaim,Page,Entity,Relation,Evidence,Session, andProposalall declare a bareid: strwith no validator, and_yamlturned that straight intokb_dir/<sub>/<id>.yamlwith no containment check. A poisoned id could thenescape
kb_dir:import_applyuses_safe_member_pathto keep each tar member's filename insidekb_dir,but does not check the
idfield inside the YAML body. A claim landedunder
claims/innocent.yamlwhoseidis"../../../tmp/pwned"readsback fine (safe filename), but the next
update_claim/supersede/contradictresolves_claim_path(claim.id)to a path outsidekb_dir.get_claim("../../../etc/hostname")resolvedstraight through
_yamlbefore this change.This is the same class of hole the tar-member fix (
bundle._safe_member_path,the CVE-2007-4559 fix) already closed on the import path —
_reject_unsafe_idmirrors its
_unsafe_name_reasonchecks at the storage chokepoint. closes #149.What might break
nothing legitimate. artifact ids are flat slugs / hex / uuids — none contain
/,\,.., or nul — so every real read/write is unaffected (the fullsuite, including bundle/sync/session round-trips, is green). the only calls
that now raise
ValueErrorare ones that were already unsafe. no on-diskformat, kb.* method, or audit-log shape changes.
a complementary model-layer id validator (mirroring
Source._id_is_hex_sha256) is deliberately left as a follow-up: the storageguard already closes every reach path, including raw-id reads a model
validator cannot see, and keeping it storage-only avoids overlapping the
model classes.
VEP
not a surface change — a storage-layer containment guard in the same shape
already merged for tar members. no VEP needed.
Tests
make check— ruff clean; mypy clean on touched files; pytest green(only pre-existing, environment-only failures remain:
fastapi/[web]not installed and embeddings installed, both of which also fail on
main)tests/test_storage.py:_yaml/_page_pathrejecttraversal / separator / absolute / empty ids (parametrised);
put_*,get_*, andput_pagereject traversal ids; the disk-smuggled-id →update_claimexploit chain is blocked before any write; and aregression guard that legitimate slug / hex ids still resolve
CHANGELOG.mdupdated under## [Unreleased]Summary by CodeRabbit
/or\), path traversal segments (./..), or NUL bytes are now blocked across claim and page operations, including updates.