Napi#629
Conversation
📝 WalkthroughWalkthroughThis PR introduces NAPI (New API) support, an adaptive interrupt-driven receive mode that complements the existing polling approach. A new Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/infra/datapath/main_loop.c`:
- Around line 224-227: The code unconditionally calls vec_add(*registered, *qm)
after attempting to register the queue with
rte_eth_dev_rx_intr_ctl_q(qm->port_id, qm->queue_id, RTE_EPOLL_PER_THREAD,
RTE_INTR_EVENT_ADD, NULL); so a failing registration still marks the queue as
registered and later gets skipped in napi_wait(). Change this to capture the
return value of rte_eth_dev_rx_intr_ctl_q, check for success (e.g., ret == 0),
only call vec_add(*registered, *qm) on success, and handle/log the failure path
(using process logging or similar) and do not mark the queue as registered when
the call fails; reference symbols: rte_eth_dev_rx_intr_ctl_q, vec_add,
registered, qm, napi_wait.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7c9daab1-60e9-44a6-b828-7de881b450d8
📒 Files selected for processing (4)
main/config.hmain/main.cmodules/infra/control/port.cmodules/infra/datapath/main_loop.c
vjardin
left a comment
There was a problem hiding this comment.
sched_setattr() shall be used from glibc.
| // utilization and downclocks the core even at line rate. Pin uclamp_min to the | ||
| // max capacity: the governor runs the core at full speed while the worker is | ||
| // runnable and lets it drop only when it actually sleeps on the interrupt. | ||
| // glibc exposes neither struct sched_attr nor a sched_setattr() wrapper. |
There was a problem hiding this comment.
|
Disclaimer: I haven't looked at the patch in detail, so my feedback is high-level only. Have you measured the RX interrupt wakeup latency? In other words: Are user-space interrupts fast enough for general use, or only for low-traffic hours? Another detail: |
Add an opt-in --napi mode where an idle worker arms the interrupts on its rx queues and blocks on them through the generic rte_eth_dev_rx_intr_* / rte_epoll_wait API instead of busy-polling. A packet wakes the worker, which disarms and resumes polling: the usual poll/interrupt hybrid, with the interrupt acting only as a doorbell since frames are still pulled by the graph walk. A worker blocks only after staying idle for NAPI_EMPTY_WINDOWS housekeeping windows with all of its queues empty, so a single busy queue keeps it polling. --napi implies poll-mode and replaces the micro-sleep ramp with the interrupt block. As that block can last up to a second it is measured explicitly and the timestamp advanced past it, keeping the sleep in total_cycles but out of busy_cycles. napi_wait() tracks the queues it actually armed and disarms them through a single exit path, so a queue without interrupt support does not leave its predecessors armed, and only marks a queue epoll-registered once. A PMD without rx queue interrupt support keeps polling. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
In --napi mode an idle worker blocks on the rxq interrupt, so the schedutil governor sees a low utilization and downclocks the core even when it later runs at line rate. Pin the worker's uclamp_min to the max capacity through sched_setattr(): the governor keeps the core at full speed while the worker is runnable and lets it drop only when it actually sleeps on the interrupt. glibc exposes neither struct sched_attr nor a sched_setattr() wrapper, so both are declared locally. The syscall fails on kernels without uclamp support or without the privilege to set it, which would otherwise warn on every worker. Report the expected EOPNOTSUPP/ENOSYS/EPERM/EINVAL cases at NOTICE and keep WARNING for anything unexpected. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The issue is that attr (used with pthread_attr_destroy) may
be uninitialized if eventfd() fails and the function jumps to end; to fix,
ensure pthread_attr_t attr is initialized before any early goto that can skip
pthread_attr_init or rearrange the control flow so pthread_attr_destroy is only
called when pthread_attr_init succeeded: either move the
pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 90a04b41-6c86-4ef5-a98d-1c2a45c64c6d
📒 Files selected for processing (7)
main/config.hmain/main.cmeson.buildmodules/infra/control/port.cmodules/infra/control/worker.cmodules/infra/control/worker.hmodules/infra/datapath/main_loop.c
✅ Files skipped from review due to trivial changes (1)
- main/config.h
🚧 Files skipped from review as they are similar to previous changes (3)
- main/main.c
- modules/infra/control/port.c
- modules/infra/datapath/main_loop.c
| worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); | ||
| if (worker->wakeup_fd < 0) { | ||
| ret = errno; | ||
| goto end; | ||
| } |
There was a problem hiding this comment.
Undefined behavior: pthread_attr_destroy called on uninitialized attr when eventfd() fails.
If eventfd() fails, the goto end at line 53 jumps past pthread_attr_init() at line 58. Then pthread_attr_destroy(&attr) at line 77 is called on an uninitialized stack variable.
Proposed fix: initialize attr before eventfd or restructure gotos
+ pthread_attr_init(&attr);
+
worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
if (worker->wakeup_fd < 0) {
ret = errno;
goto end;
}
CPU_ZERO(&cpuset);
CPU_SET(cpu_id, &cpuset);
- pthread_attr_init(&attr);
if (!!(ret = pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset)))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/infra/control/worker.c` around lines 50 - 54, The issue is that attr
(used with pthread_attr_destroy) may be uninitialized if eventfd() fails and the
function jumps to end; to fix, ensure pthread_attr_t attr is initialized before
any early goto that can skip pthread_attr_init or rearrange the control flow so
pthread_attr_destroy is only called when pthread_attr_init succeeded: either
move the pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.
|
Thanks @MortenBroerup , both points are spot on. Let me split them. Wakeup latency / timerslackThe timerslack concern doesn't apply to the napi path: an idle worker doesn't So the dominant latency term isn't the scheduler, it's the PMD's interrupt To answer the underlying question directly: this is opt-in ( Other wakeup sourcesAgreed this is the important design point. Two cases: Control-plane events already work: the worker's own wakeup eventfd is in the Datapath timers: today grout's datapath is purely packet-driven, there's no |
Opt-in --napi mode: an idle worker stops busy-polling and blocks on its rx queue interrupts via rte_eth_dev_rx_intr_* / rte_epoll_wait, resuming polling when a packet wakes it. A second commit pins the worker's uclamp_min to max so schedutil keeps the core at full clock while runnable, dropping only when it actually sleeps.
NAPI Idle Mode with RX-Queue Interrupt Blocking
Introduces opt-in
--napimode that replaces busy-polling with a poll/interrupt hybrid approach for idle workers. When NAPI is enabled, the mode implies poll-mode and replaces the adaptive usleep() ramp with RX-queue interrupt blocking.Idle Detection and Blocking Strategy
The main loop tracks consecutive "empty" housekeeping windows (intervals with no packets dequeued). After
NAPI_EMPTY_WINDOWS(2) consecutive idle windows,napi_wait()is invoked, which:rte_eth_dev_rx_intr_enable()on all RX queuesrte_eth_dev_rx_intr_ctl_q()if not already present (PMDs without RX-interrupt support continue polling)rte_rcu_qsbr_thread_offline()NAPI_SETTLE_TRIES(3) times withNAPI_SETTLE_MS(100ms) timeout, then blocks indefinitely with-1timeout onrte_epoll_wait()Blocked duration is attributed to sleep_cycles (not busy_cycles) and the n_sleeps counter.
CPU Clock Management
When NAPI is enabled,
worker_perf_floor()appliessched_setattr()withSCHED_FLAG_UTIL_CLAMP_MINset to maximum (1024 SCHED_CAPACITY_SCALE) to pin the CPU to full clock while the worker is runnable, ensuring responsive RX-interrupt wakeup latency. The kernel scheduler drops frequency only when the worker actually sleeps. Kernels without uclamp support or insufficient privileges are handled gracefully:EOPNOTSUPP,ENOSYS,EPERM, orEINVALerrors are logged at NOTICE level; other errors are logged as WARNING.Worker Wakeup Mechanism
Added eventfd-based signaling for cross-worker interrupts (e.g., reconfig/shutdown kicks):
worker_wakeup()writes a uint64_t value to the worker's wakeup_fd under the existing wakeup mutex, allowing workers blocked inrte_epoll_wait()to be awakened. Write errors are logged unlessEAGAIN(treated as an already-pending, undrained kick).Configuration Changes
--napicommand-line flag enables the mode and forces poll_modenapiadded tostruct gr_configgr_config.napiis setsched_setattrsupport viaHAVE_SCHED_SETATTRflag