Optimize decrefcount by ebm · Pull Request #45821 · envoyproxy/envoy

ebm · 2026-06-24T03:17:39Z

Commit Message: adds a lock free fast path to decRefCount

Additional Description: The original decRefCount holds the global allocator lock for every decrement. We needed to hold the lock to prevent a same named stat from being created at the same time as a decrement to 0. For a ref_count_ > 1, we can exercise a lock free fast path using a CAS (compare and swap) atomic operation. There are two advantages the CAS decRefCount has over the original implementation:

Multiple threads can decrement and allocate new stats at the same time. Holding the global allocator lock would prevent this (plus the fact that a contended lock would need syscalls).
Less atomic operations (1 CAS for the optimized decRefCount), (1 CAS for locking, 2 atomic operations for unlocking and decrementing ref_count_ for the original decRefCount).

There is a small theoretical cost for the optimized decRefCount. When ref_count_ == 1, the CAS decRefCount needs an extra atomic relaxed load (~1 ns difference during benchmarking).

Benchmarks summary:

2.1x speedup for single threaded decrements to a single ref_count_.
20-245x speedup for multithreaded decrements to multiple ref_counts_.
1.7–4.6x speedup for multithreaded decrements to a single ref_count_.
1 ns regression (304 -> 305 ns) for a single threaded decrement for ref_count == 1 (because of extra atomic relaxed load).

Full Benchmark

### Original:
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
bmDecRefCountFastPathSingleThread                                       5.30 ns         5.30 ns    129486456
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:1            5.26 ns         5.25 ns    132120435
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:2             124 ns          123 ns      5683960
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:4             231 ns          212 ns      3057072
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:8             631 ns          394 ns       800000
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:16           1164 ns          438 ns       649312
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:1        5.13 ns         5.13 ns    136056691
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:2        52.3 ns         52.3 ns     11502592
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:4         184 ns          175 ns      4299432
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:8         462 ns          281 ns      1381520
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:16       1125 ns          425 ns       707456
bmDecRefCountSlowPathSingleThread/iterations:262144                      304 ns          304 ns       262144

### Optimized:
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
bmDecRefCountFastPathSingleThread                                       2.51 ns         2.51 ns    270946766
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:1            2.60 ns         2.60 ns    277691125
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:2            32.1 ns         32.1 ns     21338224
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:4            63.2 ns         63.2 ns     11335356
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:8             136 ns          136 ns      4619312
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:16            665 ns          537 ns      1321792
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:1        2.60 ns         2.60 ns    274380078
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:2        2.61 ns         2.61 ns    270253628
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:4        2.65 ns         2.65 ns    263411788
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:8        3.10 ns         3.10 ns    220235360
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:16       4.59 ns         3.71 ns    160133760
bmDecRefCountSlowPathSingleThread/iterations:262144                      305 ns          305 ns       262144

Risk Level: medium - high. Affects the allocation and freeing of all stats (changes when the global allocator mutex is acquired).

Testing: Passes all existing allocator tests (allocator_test.cc and thread_local_store_test.cc).

Docs Changes: N/A

Release Notes: N/A

Platform Specific Features: Should affect ARM more than x86 architectures (atomics/locking syscalls cost more with ARM).

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>

repokitteh-read-only · 2026-06-24T03:17:45Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45821 was opened by ebm.

see: more, trace.

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>

ebm added 3 commits June 19, 2026 21:41

refactored decRefCount to conditionally hold the global allocator lock

eba8eac

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>

added benchmark to measure the fast path decRefCount

0f87919

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>

fixed comments

1b68b93

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>

fixed formatting

a23dd1b

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>

ebm marked this pull request as ready for review June 25, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize decrefcount#45821

Optimize decrefcount#45821
ebm wants to merge 4 commits into
envoyproxy:mainfrom
ebm:optimize-decrefcount

ebm commented Jun 24, 2026

Uh oh!

repokitteh-read-only Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ebm commented Jun 24, 2026

Uh oh!

repokitteh-read-only Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant