perf(store): save metadata async#3298
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughCachedStore gains asynchronous metadata persistence via a buffered write channel and background writer goroutine. New SetMetadata/DeleteMetadata methods enqueue ops with synchronous fallback when full or stopped. Close() now stops queueing, waits for the writer to drain, then clears caches and closes the underlying store. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CachedStore
participant Channel as Async Channel
participant Writer as Writer Goroutine
participant BaseStore
Client->>CachedStore: SetMetadata(key, value)
alt Channel has room & not stopped
CachedStore->>Channel: Enqueue asyncWriteOp
Channel-->>Writer: Receive op
Writer->>BaseStore: BatchMetadata(ctx.Background(), puts/deletes)
BaseStore-->>Writer: Success/Error
else Channel full or stopped
CachedStore->>BaseStore: SetMetadata(sync fallback)
BaseStore-->>CachedStore: Result
end
CachedStore-->>Client: Return
sequenceDiagram
participant Client
participant CachedStore
participant Channel as Write Channel
participant Writer as Writer Goroutine
participant BaseStore
Client->>CachedStore: Close()
CachedStore->>CachedStore: mark stopped, stop enqueueing
CachedStore->>Channel: close channel
CachedStore->>Writer: wait for writer to exit
Writer->>Writer: drain queued operations
loop For each remaining op
Writer->>BaseStore: BatchMetadata(ctx.Background(), puts/deletes)
BaseStore-->>Writer: Result
end
Writer-->>CachedStore: exit
CachedStore->>CachedStore: clear caches
CachedStore->>BaseStore: Close()
BaseStore-->>CachedStore: closed
CachedStore-->>Client: Return
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task in 2m 47s —— View job Code Review:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3298 +/- ##
=======================================
Coverage 62.48% 62.48%
=======================================
Files 122 122
Lines 13047 13133 +86
=======================================
+ Hits 8152 8206 +54
- Misses 4009 4034 +25
- Partials 886 893 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/store/cached_store.go (1)
91-98:⚠️ Potential issue | 🟠 MajorSurface async metadata persistence failures via
Close()or another observable path.Enqueued metadata writes (
SetMetadata,DeleteMetadata) returnnilimmediately. The background worker only logs failures, and the default logger iszerolog.Nop(), which silently discards errors. This means metadata persistence can fail undetected—no returned error, no emitted logs. Surface at least the first async writer error viaClose()and make the logger configurable via aCachedStoreOption.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/store/cached_store.go` around lines 91 - 98, The background writer currently drops async metadata errors (writes via writeCh from SetMetadata/DeleteMetadata) and uses zerolog.Nop() by default; modify CachedStore to accept a CachedStoreOption to set logger (replace zerolog.Nop()) and add an internal field to capture the first async write error (e.g., firstWriteErr protected by a mutex or a buffered error channel). Update the background worker that processes asyncWriteOp to record the first non-nil error into that field instead of only logging, close the worker and signal completion as before, and change Close() to wait for the worker to finish and return the captured error (if any). Ensure SetMetadata/DeleteMetadata still enqueue immediately but that callers can observe persistence failures via Close() (or a future exported accessor) and that the new option allows injecting a real logger for immediate visibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/store/cached_store_test.go`:
- Around line 332-338: The test currently waits for the last write via
require.Eventually before calling cs.Close(), which prevents exercising
Close()'s flush path; change the flow to enqueue all writes (the loop that
writes keys to cs), call cs.Close() immediately (without waiting), then assert
directly against the underlying base store (e.g., the test's base store variable
or method such as base.GetMetadata / base data structure) that the queued writes
were flushed after Close(); replace the require.Eventually check against
cs.GetMetadata with a post-Close assertion on base to verify Close() drained the
queue.
In `@pkg/store/cached_store.go`:
- Around line 222-235: SetMetadata and DeleteMetadata currently enqueue by
reference and fall back to synchronous writes when writeCh is full, causing
ordering, aliasing and silent-failure issues; fix by making the enqueue blocking
(remove the non-blocking default path so older queued ops always run first),
copy the value slice into the asyncWriteOp before sending (avoid slice
aliasing), and extend asyncWriteOp with a response channel (e.g., resp chan
error) so callers block on the resp and return any error produced by the
background processor; apply the same changes to DeleteMetadata and update the
background goroutine to send the operation result on resp after executing
Store.SetMetadata/Store.DeleteMetadata to preserve ordering and surface async
errors.
- Around line 222-235: CachedStore.SetMetadata enqueues the caller's []byte
directly which can be mutated after return; make a defensive copy of value
before sending on cs.writeCh (e.g. newSlice := append([]byte(nil), value...))
and set asyncWriteOp.value to that copy so the writer goroutine sees an
immutable snapshot; update the enqueue path in SetMetadata to copy value before
creating/sending asyncWriteOp (retain the direct sync path unchanged).
---
Outside diff comments:
In `@pkg/store/cached_store.go`:
- Around line 91-98: The background writer currently drops async metadata errors
(writes via writeCh from SetMetadata/DeleteMetadata) and uses zerolog.Nop() by
default; modify CachedStore to accept a CachedStoreOption to set logger (replace
zerolog.Nop()) and add an internal field to capture the first async write error
(e.g., firstWriteErr protected by a mutex or a buffered error channel). Update
the background worker that processes asyncWriteOp to record the first non-nil
error into that field instead of only logging, close the worker and signal
completion as before, and change Close() to wait for the worker to finish and
return the captured error (if any). Ensure SetMetadata/DeleteMetadata still
enqueue immediately but that callers can observe persistence failures via
Close() (or a future exported accessor) and that the new option allows injecting
a real logger for immediate visibility.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 00d3f79c-48ff-4ac8-bc17-c89e521974e8
📒 Files selected for processing (2)
pkg/store/cached_store.gopkg/store/cached_store_test.go
| // Wait for the last key to land via pass-through read | ||
| require.Eventually(t, func() bool { | ||
| v, err := cs.GetMetadata(ctx, string([]byte{byte(n - 1)})) | ||
| return err == nil && len(v) == 1 && v[0] == byte(n-1) | ||
| }, 2*time.Second, 10*time.Millisecond) | ||
|
|
||
| require.NoError(t, cs.Close()) |
There was a problem hiding this comment.
This no longer exercises the Close() flush path.
require.Eventually(...) waits until the last write is already visible before Close() runs, so the test passes even if Close() stops draining queued ops. To verify the new contract, enqueue the writes, call Close() immediately, then assert on base after close.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/store/cached_store_test.go` around lines 332 - 338, The test currently
waits for the last write via require.Eventually before calling cs.Close(), which
prevents exercising Close()'s flush path; change the flow to enqueue all writes
(the loop that writes keys to cs), call cs.Close() immediately (without
waiting), then assert directly against the underlying base store (e.g., the
test's base store variable or method such as base.GetMetadata / base data
structure) that the queued writes were flushed after Close(); replace the
require.Eventually check against cs.GetMetadata with a post-Close assertion on
base to verify Close() drained the queue.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/store/cached_store.go`:
- Around line 126-136: The conversion from ops to puts/deletes drops per-key
ordering so a Delete then Set for the same key can end up applied as delete; fix
by collapsing ops per key to the last mutation before calling BatchMetadata:
iterate the ops slice (from first to last), maintain a map[string]opInfo (or
simple marker) keyed by o.key that records the final operation and value, then
after the pass produce the puts []MetadataKV and deletes []string from that map
(for a final Set produce a put, for a final Delete produce a delete) and call
cs.Store.BatchMetadata with those; update the code around ops, MetadataKV, and
cs.Store.BatchMetadata to use this coalescing logic.
- Around line 117-123: The drain loop that reads from cs.writeCh should detect a
closed channel using the comma-ok form instead of a plain receive to avoid
receiving zero-value asyncWriteOp forever; update the select case in the loop
(the one labeled drain that currently does `case op := <-cs.writeCh:`) to `case
op, ok := <-cs.writeCh:` and if !ok break drain, otherwise append op to ops;
apply the same change to the other symmetric drain at the other location so the
goroutine can exit and the done channel can be closed on shutdown.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 59e07316-b8a3-4aed-bf6e-5de11a84ea60
📒 Files selected for processing (7)
CHANGELOG.mdpkg/store/cached_store.gopkg/store/store.gopkg/store/tracing.gopkg/store/tracing_test.gopkg/store/types.gotest/mocks/store.go
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
Overview
Improve store by letting SetMetadata happen async (used in submitter and cache).
Metadata are still crucial, as we save the last included height, some cache backup and pruning info.
However, we can queue the operation when a components is limited by the heavy writes (fast submitter competes with executor for db locks)
pkill -9could leave the metadata in a weird state, but the system should autorecover with just a little extra work (double submission -- fine as syncer can handle it, pruning unexisting blocks -- fine as pruner can handle it)AI summary:
Summary by CodeRabbit