Skip to content

feat(node): peer partial-mirrors for repos with private subtrees#35

Open
beardthelion wants to merge 13 commits into
Gitlawb:mainfrom
beardthelion:feat/peer-partial-mirrors
Open

feat(node): peer partial-mirrors for repos with private subtrees#35
beardthelion wants to merge 13 commits into
Gitlawb:mainfrom
beardthelion:feat/peer-partial-mirrors

Conversation

@beardthelion

@beardthelion beardthelion commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Restores replication redundancy for repos that have a private subtree (mode B). After Phase 2 made the sync worker fail closed on any repo with withheld content, those repos only ever lived on their origin node. This lets a peer mirror the public part of such a repo without ever receiving the private blobs.

The sync worker now asks the origin (anonymously) which paths are withheld via the existing withheld-paths endpoint, then mirrors accordingly:

  • private subtree present -> clone as a git promisor (--filter=blob:limit=10g), which keeps every public object and tolerates the blobs the origin omits for an anonymous caller
  • fully public -> plain git clone --mirror / full fetch, exactly as before
  • 404 or lookup error -> plain mirror attempt; a fully-private repo also 404s the git read endpoint, so the clone fails and nothing is mirrored

The classify step is just an optimization for picking promisor mode. The git layer is the real enforcement: a plain mirror of a mode-B repo rejects the origin's intentionally-incomplete pack, so even if the lookup fails transiently the worst case is a failed sync that retries, never a leak or a corrupt mirror. The promisor mirror physically lacks the private blobs, so re-serving it can't leak them even though the mirror node doesn't hold the origin's visibility rules.

Fetches now go through the configured origin remote (refreshing its URL first) so stored promisor settings are honored on update.

Depends on the withheld-paths endpoint from #33, so this is branched off that work; the diff will show #33's files until it merges, after which I'll rebase onto main. Independent of #28/#34.

Test plan

  • Unit: classify_mirror over 200+globs, 200+empty, and 404/error.
  • Integration (against a local file:// bare origin): promisor clone ends up promisor=true, mirror=true, and retains all objects; plain clone is not a promisor; promisor fetch pulls new commits into an existing mirror.
  • Full suite: 96 tests pass, clippy and fmt clean.

Summary by CodeRabbit

  • New Features
    • Added gl clone command to produce partial/promisor clones that exclude withheld paths and re-include permitted subpaths.
    • Added GET /api/v1/repos/{owner}/{repo}/withheld-paths to report withheld and reincluded path globs for the caller.
    • Sync worker now auto-detects promisor mirror mode and adjusts clone/fetch behavior for efficient partial syncs.
  • Tests
    • Added unit tests for withheld/reincluded glob logic and promisor vs plain clone/fetch behaviors.

Three fixes from the PR Gitlawb#33 review:

- withheld_paths now applies the whole-repo "/" read gate (returns
  repo-not-found when the caller cannot read the root), matching the git read
  endpoints. Without it the endpoint disclosed a private repo's existence and
  path layout to unauthorized callers. The withheld_globs doc already assumed
  this gate existed; now it does.

- A nested allow under a denied parent (e.g. "/secret/public/**" allowed,
  "/secret/**" denied) was over-withheld: the client sparse-excluded the whole
  parent and hid paths the caller may read. The endpoint now also returns a
  "reinclude" list (allowed globs strictly under a denied one) and gl clone
  re-includes them in the sparse spec after the excludes.

- Wildcard-free globs like "/docs/private" match both the exact path and a
  subtree (per glob_matches), but the client only emitted the subtree exclude.
  sparse_patterns now emits both "/docs/private" and "/docs/private/".

Verified the exclude-then-reinclude sparse ordering checks out cleanly with
real git, plus unit tests for reincluded_globs, the nested re-include, the
exact-path exclude, and sparse_patterns.
…eld-path errors

split_once('/') accepted owner/name/extra, smuggling a path segment into
the repo name that then flowed into the API path and clone URL; reject it.

fetch_withheld swallowed every network/auth/5xx/JSON error into an empty
result, dropping to a stock clone that the node refuses once blobs are
withheld. Now only 404/501 (endpoint unsupported) fall back to empty; the
rest propagate so the real cause surfaces.
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e987f22e-5709-4762-8ed4-f2ef13db41ed

📥 Commits

Reviewing files that changed from the base of the PR and between fc05ae0 and 25390bf.

📒 Files selected for processing (2)
  • crates/gitlawb-node/src/sync.rs
  • crates/gl/src/clone.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/gitlawb-node/src/sync.rs
  • crates/gl/src/clone.rs

📝 Walkthrough

Walkthrough

Implements end-to-end partial clone support: visibility helpers compute withheld/reinclude sparse-checkout globs, a new API exposes them, the sync worker classifies promisor mirrors and applies git partial-clone config, and a new gl clone command performs sparse client-side clones honoring node-held paths.

Changes

Partial Clone and Sparse Checkout Support

Layer / File(s) Summary
Visibility helpers for sparse checkout patterns
crates/gitlawb-node/src/visibility.rs
withheld_globs and reincluded_globs functions compute denied and nested-allowed path glob patterns by probing VisibilityRule representatives with visibility_check, with unit tests validating behavior for owners and non-owners.
Withheld-paths API endpoint
crates/gitlawb-node/src/api/visibility.rs, crates/gitlawb-node/src/server.rs
New GET /api/v1/repos/{owner}/{repo}/withheld-paths endpoint returns withheld and reinclude glob arrays, performs whole-repo read authorization to prevent repo existence disclosure, and delegates to the visibility helpers.
Sync worker promisor mirror support
crates/gitlawb-node/src/sync.rs
MirrorMode enum and classify_mirror() determine mode from withheld globs; sync loop creates a shared HTTP client, queries withheld-paths per repo, and passes mode to clone_repo/fetch_repo which apply promisor-specific git configuration (blob filters, partial-clone flags) and include tests for promisor mirror creation/updates and transitions.
GL clone subcommand structure
crates/gl/src/main.rs, crates/gl/src/clone.rs
New gl clone subcommand with CloneArgs struct, module wiring, and command dispatch routing to clone::run.
GL clone implementation
crates/gl/src/clone.rs
git/git_global command helpers, sparse_patterns glob-to-pattern conversion, setup_partial_clone core cloning logic with conditional promisor mode and non-cone sparse-checkout setup, parse_repo URL normalization, fetch_withheld API querying with optional signing, and run entrypoint orchestrating the flow.
GL clone tests
crates/gl/src/clone.rs
Test utilities and comprehensive coverage: withheld path exclusion, nested allowed glob re-inclusion, alternating specificity, exact-path glob precision, sparse-checkout pattern mapping, response schema validation, and parse_repo tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • Gitlawb/node#25: The new /withheld-paths endpoint and glob helpers build on the path-scoped visibility_check and VisibilityRule model introduced earlier.

Suggested reviewers

  • kevincodex1

Poem

🐰 I fetched the globs the node would hide,
Wove sparse patterns where secrets abide,
Promisor mirrors gently skip the rest,
While clean clones wear their trimmed-down vest—
Hooray for tidy checkouts, simple and spry!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding support for peer partial-mirrors (promisor clones) for repositories with private subtrees, which is the primary feature introduced across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-node/src/sync.rs`:
- Around line 66-72: The shared reqwest client is created with
reqwest::Client::new() and can hang the sync worker; create the Client with a
default timeout (e.g. via Client::builder().timeout(...).build()) and also wrap
the call in fetch_withheld() with tokio::time::timeout(...) to bound individual
/withheld-paths requests; additionally fix Promisor→Plain rehydration in
classify_mirror()/fetch_repo(): when classify_mirror() yields MirrorMode::Plain
ensure fetch_repo clears remote.origin.promisor and
remote.origin.partialclonefilter (remove or set to None/empty) so an existing
promisor mirror is converted to a full/plain mirror, and add unit/integration
tests that transition a repo from Promisor to Plain to assert those fields are
cleared and rehydration completes.
- Around line 235-257: fetch_repo currently only sets promisor config when mode
== MirrorMode::Promisor but never clears it when classify_mirror determines the
repo moved from Promisor → Plain; update fetch_repo to detect that transition
(use the computed mode from classify_mirror) and when switching to Plain
explicitly unset remote.origin.promisor and remove
remote.origin.partialclonefilter (run git config --unset or set to false/empty
for remote.origin.promisor and git config --unset for
remote.origin.partialclonefilter), then ensure the mirror is rehydrated (e.g.,
trigger a reclone or run the rehydrate logic you use elsewhere) so that stored
promisor state does not leave the mirror incomplete; reference functions/keys:
fetch_repo, classify_mirror, and git config keys remote.origin.promisor and
remote.origin.partialclonefilter.

In `@crates/gitlawb-node/src/visibility.rs`:
- Around line 122-169: reincluded_globs currently collects two unordered buckets
(denied then allowed) which loses nesting order; change its contract to produce
an ordered sequence that preserves rule specificity (e.g., return a Vec<(String,
bool)> or Vec<PatternOp> where bool=true means include and false means exclude)
instead of only allowed globs, implement the logic by iterating over the
original rules in order and for each non-root rule computing Decision via
visibility_check(glob_prefix(&r.path_glob)) and emitting the corresponding
(r.path_glob.clone(), Decision::Allow) or (r.path_glob.clone(), Decision::Deny)
entry so the final sparse-checkout operations maintain the same ordering as the
rules, and add a regression test that applies the three-level scenario
("/secret/**" deny, "/secret/public/**" allow, "/secret/public/admin/**" deny)
to assert the emitted sequence/materialized patterns do not re-include the
deepest denied path.

In `@crates/gl/src/clone.rs`:
- Around line 133-145: In the branch==None clone path, stop calling and parsing
`git remote show origin`; instead read the local ref file
`refs/remotes/origin/HEAD` under the cloned repo (use `dest`), resolve it to the
target branch name and pass that to `git(dest, &["checkout", "-q", &head])` so
default-branch detection is local and robust (replace the Command::new("git")...
block). In `fetch_withheld` (the JSON parsing path that currently uses
`unwrap_or_default()` on `withheld`/`reinclude`), validate that
`Value::get("withheld")` and `Value::get("reinclude")` exist and are arrays (or
return a clear error) and parse their string entries explicitly instead of
defaulting to empty arrays; fail-fast on schema/type mismatches to preserve the
contract described in the doc comment.
- Around line 192-206: fetch_withheld currently tolerates missing/wrong-typed
"withheld" or "reinclude" by falling back to empty Vec, which can silently force
a full clone; update fetch_withheld to validate the response schema and fail on
mismatch: instead of the permissive globs closure, extract both fields from the
parsed JSON and attempt to deserialize each into Vec<String> (e.g. via
serde_json::from_value or .as_array() plus string-checks) and return an error
(with context) if a field is missing or not an array of strings; keep references
to the same symbols (fetch_withheld, the "withheld"/"reinclude" fields and the
behavior consumed by setup_partial_clone) so callers get an Err on schema
mismatch rather than treated as empty lists.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d37c05af-5217-4263-a8a3-93d66a13fe1b

📥 Commits

Reviewing files that changed from the base of the PR and between 6abaf1d and 516e3dd.

📒 Files selected for processing (6)
  • crates/gitlawb-node/src/api/visibility.rs
  • crates/gitlawb-node/src/server.rs
  • crates/gitlawb-node/src/sync.rs
  • crates/gitlawb-node/src/visibility.rs
  • crates/gl/src/clone.rs
  • crates/gl/src/main.rs

Comment thread crates/gitlawb-node/src/sync.rs Outdated
Comment thread crates/gitlawb-node/src/sync.rs Outdated
Comment thread crates/gitlawb-node/src/visibility.rs
Comment thread crates/gl/src/clone.rs
Comment thread crates/gl/src/clone.rs Outdated
Add a regression test for deny /secret, allow /secret/public, deny
/secret/public/admin and clarify the sparse-checkout comment. git does
not re-traverse an explicitly excluded directory, so emitting all
excludes before re-includes keeps the deepest deny in force; the broader
parent re-include does not resurrect it.
…hema

Read the default branch from the local origin/HEAD symref clone already
set, instead of parsing the localized, network-dependent output of
git remote show origin. Deserialize the withheld-paths body into a typed
struct so a missing or mistyped withheld/reinclude field is a hard error
rather than silently becoming an empty list, which would mask a server
regression behind a later clone failure.
Give the sync worker's HTTP client a 30s timeout so a stalled peer
withheld-paths lookup cannot hang the worker loop. When a repo that was
mirrored as a promisor (mode B) becomes fully public, fetch_repo now
clears remote.origin.promisor and partialclonefilter and refetches, so
the once-withheld blobs are backfilled instead of the mirror staying
permanently partial.
@beardthelion beardthelion force-pushed the feat/peer-partial-mirrors branch from fc05ae0 to 25390bf Compare June 10, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant