Napi by maxime-leroy · Pull Request #629 · DPDK/grout

maxime-leroy · 2026-06-09T17:50:24Z

Opt-in --napi mode: an idle worker stops busy-polling and blocks on its rx queue interrupts via rte_eth_dev_rx_intr_* / rte_epoll_wait, resuming polling when a packet wakes it. A second commit pins the worker's uclamp_min to max so schedutil keeps the core at full clock while runnable, dropping only when it actually sleeps.

NAPI Idle Mode with RX-Queue Interrupt Blocking

Introduces opt-in --napi mode that replaces busy-polling with a poll/interrupt hybrid approach for idle workers. When NAPI is enabled, the mode implies poll-mode and replaces the adaptive usleep() ramp with RX-queue interrupt blocking.

Idle Detection and Blocking Strategy

The main loop tracks consecutive "empty" housekeeping windows (intervals with no packets dequeued). After NAPI_EMPTY_WINDOWS (2) consecutive idle windows, napi_wait() is invoked, which:

Arms RX-queue interrupts via rte_eth_dev_rx_intr_enable() on all RX queues
Registers the worker's wakeup eventfd with the per-thread epoll instance (once per thread)
Registers armed queues with epoll via rte_eth_dev_rx_intr_ctl_q() if not already present (PMDs without RX-interrupt support continue polling)
Performs a recheck by polling queue counts to catch packets arriving between the idle decision and interrupt arming
Temporarily goes offline from RCU QSBR via rte_rcu_qsbr_thread_offline()
Waits up to NAPI_SETTLE_TRIES (3) times with NAPI_SETTLE_MS (100ms) timeout, then blocks indefinitely with -1 timeout on rte_epoll_wait()
Re-enters RCU QSBR after waking
Drains any pending wakeup eventfd writes (used for reconfig/shutdown kicks)
Disarms only the queues it actually armed

Blocked duration is attributed to sleep_cycles (not busy_cycles) and the n_sleeps counter.

CPU Clock Management

When NAPI is enabled, worker_perf_floor() applies sched_setattr() with SCHED_FLAG_UTIL_CLAMP_MIN set to maximum (1024 SCHED_CAPACITY_SCALE) to pin the CPU to full clock while the worker is runnable, ensuring responsive RX-interrupt wakeup latency. The kernel scheduler drops frequency only when the worker actually sleeps. Kernels without uclamp support or insufficient privileges are handled gracefully: EOPNOTSUPP, ENOSYS, EPERM, or EINVAL errors are logged at NOTICE level; other errors are logged as WARNING.

Worker Wakeup Mechanism

Added eventfd-based signaling for cross-worker interrupts (e.g., reconfig/shutdown kicks): worker_wakeup() writes a uint64_t value to the worker's wakeup_fd under the existing wakeup mutex, allowing workers blocked in rte_epoll_wait() to be awakened. Write errors are logged unless EAGAIN (treated as an already-pending, undrained kick).

Configuration Changes

New --napi command-line flag enables the mode and forces poll_mode
New boolean field napi added to struct gr_config
Port configuration conditionally enables RX-queue interrupts when gr_config.napi is set
Build-time feature detection for sched_setattr support via HAVE_SCHED_SETATTR flag

coderabbitai · 2026-06-09T17:55:11Z

📝 Walkthrough

Walkthrough

This PR introduces NAPI (New API) support, an adaptive interrupt-driven receive mode that complements the existing polling approach. A new -n/--napi command-line flag enables the feature and automatically activates poll mode. The datapath worker loop gains interrupt-aware idle logic: after consecutive idle windows, it arms RX queue interrupts, blocks on per-thread epoll waiting for events while taking QSBR threads offline, then disables interrupts and returns. The port configuration conditionally enables RX queue interrupts, and a CPU utilization floor is optionally set at worker startup.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/datapath/main_loop.c`:
- Around line 224-227: The code unconditionally calls vec_add(*registered, *qm)
after attempting to register the queue with
rte_eth_dev_rx_intr_ctl_q(qm->port_id, qm->queue_id, RTE_EPOLL_PER_THREAD,
RTE_INTR_EVENT_ADD, NULL); so a failing registration still marks the queue as
registered and later gets skipped in napi_wait(). Change this to capture the
return value of rte_eth_dev_rx_intr_ctl_q, check for success (e.g., ret == 0),
only call vec_add(*registered, *qm) on success, and handle/log the failure path
(using process logging or similar) and do not mark the queue as registered when
the call fails; reference symbols: rte_eth_dev_rx_intr_ctl_q, vec_add,
registered, qm, napi_wait.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c9daab1-60e9-44a6-b828-7de881b450d8

📥 Commits

Reviewing files that changed from the base of the PR and between e850e7a and 5d8ba94.

📒 Files selected for processing (4)

main/config.h
main/main.c
modules/infra/control/port.c
modules/infra/datapath/main_loop.c

vjardin

sched_setattr() shall be used from glibc.

vjardin · 2026-06-09T18:56:09Z

+// utilization and downclocks the core even at line rate. Pin uclamp_min to the
+// max capacity: the governor runs the core at full speed while the worker is
+// runnable and lets it drop only when it actually sleeps on the interrupt.
+// glibc exposes neither struct sched_attr nor a sched_setattr() wrapper.


https://lists.gnu.org/archive/html/info-gnu/2025-01/msg00014.html. ?

MortenBroerup · 2026-06-11T08:30:25Z

Disclaimer: I haven't looked at the patch in detail, so my feedback is high-level only.

Have you measured the RX interrupt wakeup latency? In other words: Are user-space interrupts fast enough for general use, or only for low-traffic hours?
AFAIK, sleep(), usleep() and nanosleep() all end up in the nanosleep() syscall.
With the default timerslack of 50 µs, nanosleep() wakeup latency exceeds 50 µs.
With timerslack 1 - prctl(PR_SET_TIMERSLACK, 1, 0, 0, 0) - I have seen nanosleep() wakeup latency around 2.5 µs on hardware, and 15-20 µs on a VMware virtual appliance.

Another detail:
We considered using RX interrupts in the SmartShare too, and I like the concept.
You should consider that other events may want to trigger graph wakeup too...
E.g. a host-originating packet (e.g. a CLI output via SSH).
Or a timer wheel or other pollling-based dataplane timer triggering some high-frequency event, e.g. for calling rte_sched_port_dequeue().

Add an opt-in --napi mode where an idle worker arms the interrupts on its rx queues and blocks on them through the generic rte_eth_dev_rx_intr_* / rte_epoll_wait API instead of busy-polling. A packet wakes the worker, which disarms and resumes polling: the usual poll/interrupt hybrid, with the interrupt acting only as a doorbell since frames are still pulled by the graph walk. A worker blocks only after staying idle for NAPI_EMPTY_WINDOWS housekeeping windows with all of its queues empty, so a single busy queue keeps it polling. --napi implies poll-mode and replaces the micro-sleep ramp with the interrupt block. As that block can last up to a second it is measured explicitly and the timestamp advanced past it, keeping the sleep in total_cycles but out of busy_cycles. napi_wait() tracks the queues it actually armed and disarms them through a single exit path, so a queue without interrupt support does not leave its predecessors armed, and only marks a queue epoll-registered once. A PMD without rx queue interrupt support keeps polling. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

In --napi mode an idle worker blocks on the rxq interrupt, so the schedutil governor sees a low utilization and downclocks the core even when it later runs at line rate. Pin the worker's uclamp_min to the max capacity through sched_setattr(): the governor keeps the core at full speed while the worker is runnable and lets it drop only when it actually sleeps on the interrupt. glibc exposes neither struct sched_attr nor a sched_setattr() wrapper, so both are declared locally. The syscall fails on kernels without uclamp support or without the privilege to set it, which would otherwise warn on every worker. Report the expected EOPNOTSUPP/ENOSYS/EPERM/EINVAL cases at NOTICE and keep WARNING for anything unexpected. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The issue is that attr (used with pthread_attr_destroy) may
be uninitialized if eventfd() fails and the function jumps to end; to fix,
ensure pthread_attr_t attr is initialized before any early goto that can skip
pthread_attr_init or rearrange the control flow so pthread_attr_destroy is only
called when pthread_attr_init succeeded: either move the
pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90a04b41-6c86-4ef5-a98d-1c2a45c64c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 5d8ba94 and a5d9ec9.

📒 Files selected for processing (7)

main/config.h
main/main.c
meson.build
modules/infra/control/port.c
modules/infra/control/worker.c
modules/infra/control/worker.h
modules/infra/datapath/main_loop.c

✅ Files skipped from review due to trivial changes (1)

main/config.h

🚧 Files skipped from review as they are similar to previous changes (3)

main/main.c
modules/infra/control/port.c
modules/infra/datapath/main_loop.c

coderabbitai · 2026-06-11T15:58:44Z

+	worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+	if (worker->wakeup_fd < 0) {
+		ret = errno;
+		goto end;
+	}


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Undefined behavior: pthread_attr_destroy called on uninitialized attr when eventfd() fails.

If eventfd() fails, the goto end at line 53 jumps past pthread_attr_init() at line 58. Then pthread_attr_destroy(&attr) at line 77 is called on an uninitialized stack variable.

Proposed fix: initialize attr before eventfd or restructure gotos

+ pthread_attr_init(&attr); + worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); if (worker->wakeup_fd < 0) { ret = errno; goto end; } CPU_ZERO(&cpuset); CPU_SET(cpu_id, &cpuset); - pthread_attr_init(&attr); if (!!(ret = pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset)))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/control/worker.c` around lines 50 - 54, The issue is that attr (used with pthread_attr_destroy) may be uninitialized if eventfd() fails and the function jumps to end; to fix, ensure pthread_attr_t attr is initialized before any early goto that can skip pthread_attr_init or rearrange the control flow so pthread_attr_destroy is only called when pthread_attr_init succeeded: either move the pthread_attr_init(&attr) before the eventfd() call (so attr is always initialized) or add a boolean/flag (e.g., attr_initialized) set after pthread_attr_init and check it before calling pthread_attr_destroy(&attr); update references in this function where worker->wakeup_fd, eventfd, pthread_attr_init, pthread_attr_destroy, and attr are used.

maxime-leroy · 2026-06-11T15:59:36Z

Thanks @MortenBroerup , both points are spot on. Let me split them.

Wakeup latency / timerslack

The timerslack concern doesn't apply to the napi path: an idle worker doesn't
sleep on a timer, it blocks on rte_epoll_wait() on the rx queue interrupt
eventfd (VFIO/UIO), so the wakeup is interrupt-driven and timerslack never
enters the picture. timerslack only widens the timeout part of a sleep/poll,
not the delivery of a real fd event. (For completeness, the existing
sleep-based poll mode already lowers timerslack to 1us via
PR_SET_TIMERSLACK at worker start, so it isn't stuck at the 50us default
either.)

So the dominant latency term isn't the scheduler, it's the PMD's interrupt
coalescing. The IRQ fires on the first of "N frames queued" or "holdoff timer
expires". On DPAA2 the default is threshold = 7 frames, holdoff = 100us. Under
any real load the frame threshold trips first and the holdoff never matters;
the holdoff only adds latency in the trickle case (1-2 frames then silence),
i.e. exactly the near-idle regime this mode targets. Worst-case single-packet
wakeup at trickle is therefore bounded by the holdoff (~100us on DPAA2,
tunable via DPAA2_PORTAL_INTR_TIMEOUT).

To answer the underlying question directly: this is opt-in (--napi) and aimed
at cutting power/clocks during quiet periods, not at minimum-latency max-pps
forwarding. It's "general use during low/moderate load", and you'd leave it off
on a latency-critical fast path. I haven't done a rigorous latency sweep across
PMDs yet; if that would help I can gather numbers.

Other wakeup sources

Agreed this is the important design point. Two cases:

Control-plane events already work: the worker's own wakeup eventfd is in the
same per-thread epoll set, so a reconfig or shutdown breaks the block
immediately instead of waiting for a packet. That's the generic kick path, so
anything host-originated that needs to poke a sleeping worker can reuse it.

Datapath timers: today grout's datapath is purely packet-driven, there's no
timer wheel, no rte_sched, nothing time-driven that a sleeping worker would
starve, so blocking until the next packet is safe as-is. The day a periodic
fast-path event shows up (QoS dequeue / shaping via rte_sched_port_dequeue(),
TX pacing, datapath aging), an indefinite block would indeed starve it, and the
fix is the standard one: add a timerfd to the same epoll set (or cap the
epoll_wait timeout to the next deadline) so the worker wakes on a packet or a
deadline. Same shape as the wakeup eventfd that's already wired in, so the
extension point is there when it's needed.

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread modules/infra/datapath/main_loop.c Outdated

vjardin suggested changes Jun 9, 2026

View reviewed changes

maxime-leroy added 2 commits June 11, 2026 17:15

maxime-leroy force-pushed the napi branch from 5d8ba94 to a5d9ec9 Compare June 11, 2026 15:54

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Napi#629

Napi#629
maxime-leroy wants to merge 2 commits into
DPDK:mainfrom
maxime-leroy:napi

maxime-leroy commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

vjardin left a comment

Uh oh!

vjardin Jun 9, 2026

Uh oh!

MortenBroerup commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

maxime-leroy commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maxime-leroy commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NAPI Idle Mode with RX-Queue Interrupt Blocking

Idle Detection and Blocking Strategy

CPU Clock Management

Worker Wakeup Mechanism

Configuration Changes

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vjardin left a comment

Choose a reason for hiding this comment

Uh oh!

vjardin Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

MortenBroerup commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

maxime-leroy commented Jun 11, 2026

Wakeup latency / timerslack

Other wakeup sources

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maxime-leroy commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading