feat(service): add layer cache for cross-mount dedup#40
Conversation
📊 Code Coverage Report
📦 Per-package breakdown |
There was a problem hiding this comment.
Code Review
This pull request upgrades the Go environment to version 1.25, updates several core dependencies, and introduces a new LayerCache system to optimize model pulls. The cache implements layer-level singleflighting and cross-volume deduplication using hardlinks, with state persisted via layers.json files. Review feedback identifies a critical path traversal vulnerability when resolving layer filepaths from OCI annotations. Additionally, there are concerns regarding the performance impact of frequent synchronous disk I/O for cache persistence, potential thundering herd issues in the concurrency model during context cancellation, and the functional limitations of hardlinking across different filesystems in CSI environments.
| if rel == "" { | ||
| return "" | ||
| } | ||
| return filepath.Join(h.targetDir, rel) |
| import ( | ||
| "context" | ||
| "path/filepath" | ||
| "sync" |
| return errors.Wrap(err, "remove stale hardlink target") | ||
| } | ||
| if err := os.Link(src, dst); err != nil { | ||
| return errors.Wrap(err, "create hardlink") |
There was a problem hiding this comment.
os.Link creates a hard link, which is restricted to the same filesystem. In CSI environments, volumes (PVCs) are often backed by different devices or filesystems. Hard linking across them will fail with EXDEV, causing the cache to fall back to a full pull. This significantly limits the effectiveness of cross-volume deduplication unless all volumes share the same underlying host filesystem.
| } | ||
| c.mu.Unlock() | ||
|
|
||
| if err := writeLayersFile(volDir, snapshot); err != nil { |
There was a problem hiding this comment.
writeLayersFile is called synchronously for every layer published (including cache hits). This results in frequent and redundant disk I/O for layers.json during a model pull. Consider batching these updates or persisting the mapping once after the entire pull operation is complete to improve performance.
| select { | ||
| case <-ctx.Done(): | ||
| cond.L.Lock() | ||
| cond.Broadcast() |
There was a problem hiding this comment.
Calling cond.Broadcast() on context cancellation wakes up all pullers waiting for this digest, creating a thundering herd problem. Each puller will contend for the lock only to potentially go back to sleep if their own context is still active. A more efficient approach would be to use a per-waiter channel for targeted notification.
506c56c to
d9fcc54
Compare
Signed-off-by: chlins <chlins.zhang@gmail.com>
This pull request introduces a new layer cache hook system to improve layer deduplication and persistence in the model CSI driver, alongside several dependency and tooling updates. The most significant changes are the addition of the
layercache_hookimplementation and its comprehensive tests, as well as updates to Go versioning and dependencies to ensure compatibility and improved performance.Layer cache hook implementation and testing:
pkg/service/layercache_hook.go, which implements acombinedHookandlayerCacheHookto manage layer deduplication, cache hits, and persistence across volumes. This ensures that all volumes maintain accurate cache state, even after owner deletion or restarts.pkg/service/layercache_hook_test.gowith thorough unit tests for the new hook logic, covering cache hits, ownership transfer, deduplication persistence, and edge cases.Go version and dependency upgrades:
go.mod,build/Dockerfile, and GitHub Actions workflows for both coverage and lint jobs, ensuring the project uses the latest language features and security updates. [1] [2] [3]go.mod, includinggo-git,modctl,go.opentelemetry.io/otel, and others, to newer versions for improved stability, performance, and compatibility. [1] [2] [3] [4]CI/CD and tooling improvements:
golangci-lint-actionversion in.github/workflows/lint.ymlfor better linting performance and reliability.Test and interface adjustments:
pkg/server/server_test.goto support the new layer cache hook interface and its additional parameters. [1] [2]These changes collectively enhance the reliability, maintainability, and efficiency of layer handling in the driver, while keeping the codebase up-to-date with the latest Go ecosystem advancements.