Skip to content

[fix][broker] Avoid blocking the metrics thread on the pending-ack managed ledger#26062

Open
merlimat wants to merge 1 commit into
apache:masterfrom
merlimat:mmerli/fix-txn-aggregator-metrics-blocking
Open

[fix][broker] Avoid blocking the metrics thread on the pending-ack managed ledger#26062
merlimat wants to merge 1 commit into
apache:masterfrom
merlimat:mmerli/fix-txn-aggregator-metrics-blocking

Conversation

@merlimat

Copy link
Copy Markdown
Contributor

Motivation

TransactionAggregator.generate (per-subscription transaction metrics) read the pending-ack store's managed ledger with ((PersistentSubscription) subscription).getPendingAckManageLedger().get() — an un-timed CompletableFuture.get() running on the prometheus-stats executor (a bounded, 4-thread pool).

The guard checkIfPendingAckStoreInit() only verifies pendingAckStoreFuture.isDone(); getStoreManageLedger() then composes pendingAckStoreFuture.thenCompose(store -> store.getManagedLedger()), so the managed-ledger future can still be resolving. A stuck future blocks that metrics thread, and enough of them exhaust the pool — hanging all metric scraping.

Modifications

Use getNow(null) instead of get(): if the managed ledger is already resolved it is used exactly as before; if not, this subscription's pending-ack-store stats are skipped for the current scrape and picked up on the next one. A future that completed exceptionally still throws into the existing catch (unchanged).

Verifying this change

Covered by existing tests, which pass: TransactionMetricsTest.testManagedLedgerMetrics (ledger ready → stats present) and testManagedLedgerMetricsWhenPendingAckNotInit (store not initialized → skipped).

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

…naged ledger

TransactionAggregator.generate read the pending-ack store's managed ledger with
((PersistentSubscription) subscription).getPendingAckManageLedger().get() -- an
un-timed CompletableFuture.get() running on the prometheus-stats executor (a
bounded, 4-thread pool). The guard checkIfPendingAckStoreInit() only verifies
pendingAckStoreFuture.isDone(); getStoreManageLedger() then composes
pendingAckStoreFuture.thenCompose(store -> store.getManagedLedger()), so the
managed-ledger future can still be resolving. A stuck future blocks that metrics
thread, and enough of them exhaust the pool, hanging all metric scraping.

Use getNow(null) instead of get(): if the managed ledger is already resolved it
is used as before; if not, this subscription's pending-ack-store stats are
skipped for the current scrape and picked up on the next one. A future that
completed exceptionally still throws into the existing catch (unchanged).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants