Skip to content

perf(store): save metadata async#3298

Open
julienrbrt wants to merge 6 commits intomainfrom
julien/async-metadata
Open

perf(store): save metadata async#3298
julienrbrt wants to merge 6 commits intomainfrom
julien/async-metadata

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Apr 28, 2026

Overview

Improve store by letting SetMetadata happen async (used in submitter and cache).
Metadata are still crucial, as we save the last included height, some cache backup and pruning info.
However, we can queue the operation when a components is limited by the heavy writes (fast submitter competes with executor for db locks)

pkill -9 could leave the metadata in a weird state, but the system should autorecover with just a little extra work (double submission -- fine as syncer can handle it, pruning unexisting blocks -- fine as pruner can handle it)


AI summary:

Here's a summary of the change:
Problem: All block/ components share the same Badger-backed store. When the submitter or pruner writes metadata (SetMetadata/DeleteMetadata), they block Badger's write pipeline, slowing down the executor/syncer's critical batch commits (SaveBlockData + SetHeight + UpdateState + Commit).
Solution: Async metadata writes in CachedStore (pkg/store/cached_store.go):
- SetMetadata and DeleteMetadata now enqueue operations to a buffered channel (8192 slots) instead of writing synchronously
- A background goroutine (auto-started in NewCachedStore) drains the channel and persists writes to the underlying store
- If the buffer is full, writes fall back to synchronous (backpressure)
- Close() drains all pending writes before closing the underlying store
- All other operations (reads, batches, pruning, rollback) remain synchronous pass-throughs
Why this works:
- Executor/Syncer use NewBatch/Commit for block writes — these remain synchronous and are never blocked by metadata writes
- Submitter's processDAInclusionLoop does 3-4 SetMetadata calls per height (DA inclusion height, height-to-DA mappings) — now async, non-blocking
- Cache's setLastSubmittedHeight writes after every DA submission — now async
- Pruner's metadata writes — now async. Heavy operations (PruneBlocks, DeleteStateAtHeight) remain sync
- No changes to components.go — CachedStore is already the wrapper used everywhere

Summary by CodeRabbit

  • New Features
    • Improved metadata write performance through optimized processing
    • Support for atomic batch metadata updates ensuring transactional consistency
    • Enhanced visibility for metadata operations via distributed tracing

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@julienrbrt has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 23 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 081a8508-cb46-4128-8b5b-58bf4de6364f

📥 Commits

Reviewing files that changed from the base of the PR and between 6c53bc4 and 2d3dfd7.

📒 Files selected for processing (2)
  • pkg/store/cached_store.go
  • pkg/store/cached_store_test.go
📝 Walkthrough

Walkthrough

CachedStore gains asynchronous metadata persistence via a buffered write channel and background writer goroutine. New SetMetadata/DeleteMetadata methods enqueue ops with synchronous fallback when full or stopped. Close() now stops queueing, waits for the writer to drain, then clears caches and closes the underlying store.

Changes

Cohort / File(s) Summary
Cached store async logic
pkg/store/cached_store.go
Adds buffered writeCh, startWriteLoop writer goroutine, asyncWriteOp enqueueing, public SetMetadata() and DeleteMetadata() with synchronous fallback when full/stopped, and Close() updated to stop, close channel, wait for writer, then clear caches and close underlying store.
Store metadata batch API and tracing
pkg/store/store.go, pkg/store/tracing.go, pkg/store/types.go
Introduces MetadataKV type and Metadata.BatchMetadata(ctx, puts, deletes) error interface method; implements DefaultStore.BatchMetadata; adds traced wrapper tracedStore.BatchMetadata with OpenTelemetry span and attributes.
Tests and mocks
pkg/store/cached_store_test.go, pkg/store/tracing_test.go, test/mocks/store.go
Adds tests exercising async metadata persistence, shutdown draining, and post-close behavior; extends mock store with BatchMetadata support and a configurable callback for tests.
Changelog
CHANGELOG.md
Adds an Unreleased bullet documenting async metadata write optimization (PR reference).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CachedStore
    participant Channel as Async Channel
    participant Writer as Writer Goroutine
    participant BaseStore

    Client->>CachedStore: SetMetadata(key, value)
    alt Channel has room & not stopped
        CachedStore->>Channel: Enqueue asyncWriteOp
        Channel-->>Writer: Receive op
        Writer->>BaseStore: BatchMetadata(ctx.Background(), puts/deletes)
        BaseStore-->>Writer: Success/Error
    else Channel full or stopped
        CachedStore->>BaseStore: SetMetadata(sync fallback)
        BaseStore-->>CachedStore: Result
    end
    CachedStore-->>Client: Return
Loading
sequenceDiagram
    participant Client
    participant CachedStore
    participant Channel as Write Channel
    participant Writer as Writer Goroutine
    participant BaseStore

    Client->>CachedStore: Close()
    CachedStore->>CachedStore: mark stopped, stop enqueueing
    CachedStore->>Channel: close channel
    CachedStore->>Writer: wait for writer to exit
    Writer->>Writer: drain queued operations
    loop For each remaining op
        Writer->>BaseStore: BatchMetadata(ctx.Background(), puts/deletes)
        BaseStore-->>Writer: Result
    end
    Writer-->>CachedStore: exit
    CachedStore->>CachedStore: clear caches
    CachedStore->>BaseStore: Close()
    BaseStore-->>CachedStore: closed
    CachedStore-->>Client: Return
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop along a buffered line,
enqueue a key, enqueue a time;
a goroutine hums, then drains with grace,
closing softly—no writes misplaced. 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf(store): save metadata async' accurately summarizes the main change: asynchronous metadata write handling in the store to improve performance.
Description check ✅ Passed The description provides detailed context, problem statement, solution approach, and justification for the change. It includes an AI summary explaining the implementation and rationale.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/async-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 28, 2026, 6:00 PM

@julienrbrt julienrbrt marked this pull request as ready for review April 28, 2026 15:23
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 28, 2026

Claude finished @julienrbrt's task in 2m 47s —— View job


Code Review: perf(store): save metadata async

  • Read CLAUDE.md / AGENTS.md for repo guidelines
  • Review changed files
  • Analyze concurrency correctness
  • Assess error handling and resilience
  • Review test coverage
  • Post final review

Summary

The design is solid — moving metadata writes to an async, coalesced batch path removes them from Badger's critical write pipeline and is the right trade-off for this workload. The coalescing logic, the Close() drain guarantee, and the value copy for slice safety are all well thought out. A few issues need addressing before merge.


Critical: TOCTOU race → panic on closed channel

pkg/store/cached_store.go:257-265 and 271-280

SetMetadata and DeleteMetadata release stopMu.RUnlock() before they send to writeCh. If Close() runs between the unlock and the send, it will close(writeCh) while the sender is mid-flight → panic: send on closed channel.

Reproduction sequence:

goroutine A (SetMetadata):  acquires RLock → checks stopped=false → releases RLock
goroutine B (Close):        acquires WLock → stopped=true → close(writeCh) → releases WLock
goroutine A (SetMetadata):  cs.writeCh <- ...  ← PANIC

Fix: hold the read lock through the channel send (same pattern as the stopped fallback path):

func (cs *CachedStore) SetMetadata(ctx context.Context, key string, value []byte) error {
    cs.stopMu.RLock()
    defer cs.stopMu.RUnlock()

    if cs.stopped {
        return cs.Store.SetMetadata(ctx, key, value)
    }
    valueCopy := append([]byte(nil), value...)
    cs.writeCh <- asyncWriteOp{key: key, value: valueCopy}
    return nil
}

Same pattern for DeleteMetadata. Holding the read lock while blocking on a full channel is safe because the write loop keeps draining it, so the send will eventually succeed and the lock will be released before Close() can acquire the write lock.

Fix this →


Moderate: Silent async write failures

pkg/store/cached_store.go:97-103

The logger defaults to zerolog.Nop() with no option to inject a real one. Any BatchMetadata failure in the write loop is completely invisible — no returned error, no observable log. This makes diagnosing metadata corruption (the scenario described in the PR overview) impossible in production.

WithLogger should be a first-class option:

func WithLogger(l zerolog.Logger) CachedStoreOption {
    return func(cs *CachedStore) error {
        cs.logger = l
        return nil
    }
}

And the callers in components.go / wherever NewCachedStore is used should pass the node's logger. Fix this →


Minor: Read-your-writes semantics not documented

pkg/store/cached_store.go:253-266

GetMetadata goes directly to the underlying store and bypasses the async queue. Code that calls SetMetadata("k", v) and then GetMetadata("k") within the same synchronous flow may read a stale value. The PR description mentions this for crash recovery, but callers of CachedStore (e.g. the pruner reading LastPrunedBlockHeightKey right after writing it) could be silently affected.

The SetMetadata docstring should call this out explicitly — something like: "The write is applied asynchronously; a subsequent GetMetadata may return the previous value until the background writer commits."


Minor: TestCachedStore_Close_FlushesPendingWrites bypasses Close()

pkg/store/cached_store_test.go:346-350

The test manually closes writeCh instead of calling cs.Close(). This tests the writer goroutine drain behaviour but does not test the actual Close() contract (which also calls ClearCache() and cs.Store.Close()). A follow-up assertion that the underlying base store is closed after the manual drain would strengthen this. Not blocking, but worth improving.


Nit: Missing t.Parallel() in two tests

pkg/store/cached_store_test.go:320 and 378

TestCachedStore_Close_FlushesPendingWrites and TestCachedStore_CoalescesSameKeyOps don't call t.Parallel(), unlike the other tests in the file.


Positive observations

  • The coalescing map (last[o.key] = o) correctly handles interleaved set/delete for the same key in a single drain window. The test TestCachedStore_CoalescesSameKeyOps confirms the intended semantics.
  • Defensive valueCopy := append([]byte(nil), value...) prevents slice aliasing between the caller and the writer goroutine. Good catch.
  • The batchWindow = 100µs is a sensible conservative default — small enough to stay sub-millisecond, large enough to coalesce the 3-4 per-height submitter calls that the PR motivation describes.
  • BatchMetadata on DefaultStore uses a proper ds.Batch transaction, so puts and deletes within one drain cycle are atomic.
  • Close() correctly waits for <-cs.done before calling cs.Store.Close(), ensuring all queued writes land before the database is shut down.
  • tracedStore.BatchMetadata wiring is clean and consistent with the other traced methods.

Verdict

Request changes on the critical TOCTOU race; the logger visibility is important but not blocking if there's a follow-up planned. The rest are minor.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 62.79070% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.48%. Comparing base (389e904) to head (2d3dfd7).

Files with missing lines Patch % Lines
pkg/store/tracing.go 0.00% 14 Missing ⚠️
pkg/store/store.go 33.33% 5 Missing and 5 partials ⚠️
pkg/store/cached_store.go 85.96% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3298   +/-   ##
=======================================
  Coverage   62.48%   62.48%           
=======================================
  Files         122      122           
  Lines       13047    13133   +86     
=======================================
+ Hits         8152     8206   +54     
- Misses       4009     4034   +25     
- Partials      886      893    +7     
Flag Coverage Δ
combined 62.48% <62.79%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/store/cached_store.go (1)

91-98: ⚠️ Potential issue | 🟠 Major

Surface async metadata persistence failures via Close() or another observable path.

Enqueued metadata writes (SetMetadata, DeleteMetadata) return nil immediately. The background worker only logs failures, and the default logger is zerolog.Nop(), which silently discards errors. This means metadata persistence can fail undetected—no returned error, no emitted logs. Surface at least the first async writer error via Close() and make the logger configurable via a CachedStoreOption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/store/cached_store.go` around lines 91 - 98, The background writer
currently drops async metadata errors (writes via writeCh from
SetMetadata/DeleteMetadata) and uses zerolog.Nop() by default; modify
CachedStore to accept a CachedStoreOption to set logger (replace zerolog.Nop())
and add an internal field to capture the first async write error (e.g.,
firstWriteErr protected by a mutex or a buffered error channel). Update the
background worker that processes asyncWriteOp to record the first non-nil error
into that field instead of only logging, close the worker and signal completion
as before, and change Close() to wait for the worker to finish and return the
captured error (if any). Ensure SetMetadata/DeleteMetadata still enqueue
immediately but that callers can observe persistence failures via Close() (or a
future exported accessor) and that the new option allows injecting a real logger
for immediate visibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/store/cached_store_test.go`:
- Around line 332-338: The test currently waits for the last write via
require.Eventually before calling cs.Close(), which prevents exercising
Close()'s flush path; change the flow to enqueue all writes (the loop that
writes keys to cs), call cs.Close() immediately (without waiting), then assert
directly against the underlying base store (e.g., the test's base store variable
or method such as base.GetMetadata / base data structure) that the queued writes
were flushed after Close(); replace the require.Eventually check against
cs.GetMetadata with a post-Close assertion on base to verify Close() drained the
queue.

In `@pkg/store/cached_store.go`:
- Around line 222-235: SetMetadata and DeleteMetadata currently enqueue by
reference and fall back to synchronous writes when writeCh is full, causing
ordering, aliasing and silent-failure issues; fix by making the enqueue blocking
(remove the non-blocking default path so older queued ops always run first),
copy the value slice into the asyncWriteOp before sending (avoid slice
aliasing), and extend asyncWriteOp with a response channel (e.g., resp chan
error) so callers block on the resp and return any error produced by the
background processor; apply the same changes to DeleteMetadata and update the
background goroutine to send the operation result on resp after executing
Store.SetMetadata/Store.DeleteMetadata to preserve ordering and surface async
errors.
- Around line 222-235: CachedStore.SetMetadata enqueues the caller's []byte
directly which can be mutated after return; make a defensive copy of value
before sending on cs.writeCh (e.g. newSlice := append([]byte(nil), value...))
and set asyncWriteOp.value to that copy so the writer goroutine sees an
immutable snapshot; update the enqueue path in SetMetadata to copy value before
creating/sending asyncWriteOp (retain the direct sync path unchanged).

---

Outside diff comments:
In `@pkg/store/cached_store.go`:
- Around line 91-98: The background writer currently drops async metadata errors
(writes via writeCh from SetMetadata/DeleteMetadata) and uses zerolog.Nop() by
default; modify CachedStore to accept a CachedStoreOption to set logger (replace
zerolog.Nop()) and add an internal field to capture the first async write error
(e.g., firstWriteErr protected by a mutex or a buffered error channel). Update
the background worker that processes asyncWriteOp to record the first non-nil
error into that field instead of only logging, close the worker and signal
completion as before, and change Close() to wait for the worker to finish and
return the captured error (if any). Ensure SetMetadata/DeleteMetadata still
enqueue immediately but that callers can observe persistence failures via
Close() (or a future exported accessor) and that the new option allows injecting
a real logger for immediate visibility.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00d3f79c-48ff-4ac8-bc17-c89e521974e8

📥 Commits

Reviewing files that changed from the base of the PR and between 389e904 and f7ea4ad.

📒 Files selected for processing (2)
  • pkg/store/cached_store.go
  • pkg/store/cached_store_test.go

Comment thread pkg/store/cached_store_test.go Outdated
Comment on lines +332 to +338
// Wait for the last key to land via pass-through read
require.Eventually(t, func() bool {
v, err := cs.GetMetadata(ctx, string([]byte{byte(n - 1)}))
return err == nil && len(v) == 1 && v[0] == byte(n-1)
}, 2*time.Second, 10*time.Millisecond)

require.NoError(t, cs.Close())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This no longer exercises the Close() flush path.

require.Eventually(...) waits until the last write is already visible before Close() runs, so the test passes even if Close() stops draining queued ops. To verify the new contract, enqueue the writes, call Close() immediately, then assert on base after close.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/store/cached_store_test.go` around lines 332 - 338, The test currently
waits for the last write via require.Eventually before calling cs.Close(), which
prevents exercising Close()'s flush path; change the flow to enqueue all writes
(the loop that writes keys to cs), call cs.Close() immediately (without
waiting), then assert directly against the underlying base store (e.g., the
test's base store variable or method such as base.GetMetadata / base data
structure) that the queued writes were flushed after Close(); replace the
require.Eventually check against cs.GetMetadata with a post-Close assertion on
base to verify Close() drained the queue.

Comment thread pkg/store/cached_store.go Outdated
@julienrbrt julienrbrt marked this pull request as draft April 28, 2026 15:44
@julienrbrt julienrbrt marked this pull request as ready for review April 28, 2026 17:27
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/store/cached_store.go`:
- Around line 126-136: The conversion from ops to puts/deletes drops per-key
ordering so a Delete then Set for the same key can end up applied as delete; fix
by collapsing ops per key to the last mutation before calling BatchMetadata:
iterate the ops slice (from first to last), maintain a map[string]opInfo (or
simple marker) keyed by o.key that records the final operation and value, then
after the pass produce the puts []MetadataKV and deletes []string from that map
(for a final Set produce a put, for a final Delete produce a delete) and call
cs.Store.BatchMetadata with those; update the code around ops, MetadataKV, and
cs.Store.BatchMetadata to use this coalescing logic.
- Around line 117-123: The drain loop that reads from cs.writeCh should detect a
closed channel using the comma-ok form instead of a plain receive to avoid
receiving zero-value asyncWriteOp forever; update the select case in the loop
(the one labeled drain that currently does `case op := <-cs.writeCh:`) to `case
op, ok := <-cs.writeCh:` and if !ok break drain, otherwise append op to ops;
apply the same change to the other symmetric drain at the other location so the
goroutine can exit and the done channel can be closed on shutdown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59e07316-b8a3-4aed-bf6e-5de11a84ea60

📥 Commits

Reviewing files that changed from the base of the PR and between f7ea4ad and 6c53bc4.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • pkg/store/cached_store.go
  • pkg/store/store.go
  • pkg/store/tracing.go
  • pkg/store/tracing_test.go
  • pkg/store/types.go
  • test/mocks/store.go
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md

Comment thread pkg/store/cached_store.go
Comment thread pkg/store/cached_store.go Outdated
@julienrbrt julienrbrt requested a review from tac0turtle April 28, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant