Skip to content

Napi#629

Open
maxime-leroy wants to merge 2 commits into
DPDK:mainfrom
maxime-leroy:napi
Open

Napi#629
maxime-leroy wants to merge 2 commits into
DPDK:mainfrom
maxime-leroy:napi

Conversation

@maxime-leroy

@maxime-leroy maxime-leroy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Opt-in --napi mode: an idle worker stops busy-polling and blocks on its rx queue interrupts via rte_eth_dev_rx_intr_* / rte_epoll_wait, resuming polling when a packet wakes it. A second commit pins the worker's uclamp_min to max so schedutil keeps the core at full clock while runnable, dropping only when it actually sleeps.

NAPI Idle Mode with RX-Queue Interrupt Blocking

Introduces opt-in --napi mode that replaces busy-polling with a poll/interrupt hybrid approach for idle workers. When NAPI is enabled, the mode implies poll-mode and replaces the adaptive usleep() ramp with RX-queue interrupt blocking.

Idle Detection and Blocking Strategy

The main loop tracks consecutive "empty" housekeeping windows (intervals with no packets dequeued). After NAPI_EMPTY_WINDOWS (2) consecutive idle windows, napi_wait() is invoked, which:

  • Arms RX-queue interrupts via rte_eth_dev_rx_intr_enable() on all RX queues
  • Registers the worker's wakeup eventfd with the per-thread epoll instance (once per thread)
  • Registers armed queues with epoll via rte_eth_dev_rx_intr_ctl_q() if not already present (PMDs without RX-interrupt support continue polling)
  • Performs a recheck by polling queue counts to catch packets arriving between the idle decision and interrupt arming
  • Temporarily goes offline from RCU QSBR via rte_rcu_qsbr_thread_offline()
  • Waits up to NAPI_SETTLE_TRIES (3) times with NAPI_SETTLE_MS (100ms) timeout, then blocks indefinitely with -1 timeout on rte_epoll_wait()
  • Re-enters RCU QSBR after waking
  • Drains any pending wakeup eventfd writes (used for reconfig/shutdown kicks)
  • Disarms only the queues it actually armed

Blocked duration is attributed to sleep_cycles (not busy_cycles) and the n_sleeps counter.

CPU Clock Management

When NAPI is enabled, worker_perf_floor() applies sched_setattr() with SCHED_FLAG_UTIL_CLAMP_MIN set to maximum (1024 SCHED_CAPACITY_SCALE) to pin the CPU to full clock while the worker is runnable, ensuring responsive RX-interrupt wakeup latency. The kernel scheduler drops frequency only when the worker actually sleeps. Kernels without uclamp support or insufficient privileges are handled gracefully: EOPNOTSUPP, ENOSYS, EPERM, or EINVAL errors are logged at NOTICE level; other errors are logged as WARNING.

Worker Wakeup Mechanism

Added eventfd-based signaling for cross-worker interrupts (e.g., reconfig/shutdown kicks): worker_wakeup() writes a uint64_t value to the worker's wakeup_fd under the existing wakeup mutex, allowing workers blocked in rte_epoll_wait() to be awakened. Write errors are logged unless EAGAIN (treated as an already-pending, undrained kick).

Configuration Changes

  • New --napi command-line flag enables the mode and forces poll_mode
  • New boolean field napi added to struct gr_config
  • Port configuration conditionally enables RX-queue interrupts when gr_config.napi is set
  • Build-time feature detection for sched_setattr support via HAVE_SCHED_SETATTR flag

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces NAPI (New API) support, an adaptive interrupt-driven receive mode that complements the existing polling approach. A new -n/--napi command-line flag enables the feature and automatically activates poll mode. The datapath worker loop gains interrupt-aware idle logic: after consecutive idle windows, it arms RX queue interrupts, blocks on per-thread epoll waiting for events while taking QSBR threads offline, then disables interrupts and returns. The port configuration conditionally enables RX queue interrupts, and a CPU utilization floor is optionally set at worker startup.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/datapath/main_loop.c`:
- Around line 224-227: The code unconditionally calls vec_add(*registered, *qm)
after attempting to register the queue with
rte_eth_dev_rx_intr_ctl_q(qm->port_id, qm->queue_id, RTE_EPOLL_PER_THREAD,
RTE_INTR_EVENT_ADD, NULL); so a failing registration still marks the queue as
registered and later gets skipped in napi_wait(). Change this to capture the
return value of rte_eth_dev_rx_intr_ctl_q, check for success (e.g., ret == 0),
only call vec_add(*registered, *qm) on success, and handle/log the failure path
(using process logging or similar) and do not mark the queue as registered when
the call fails; reference symbols: rte_eth_dev_rx_intr_ctl_q, vec_add,
registered, qm, napi_wait.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c9daab1-60e9-44a6-b828-7de881b450d8

📥 Commits

Reviewing files that changed from the base of the PR and between e850e7a and 5d8ba94.

📒 Files selected for processing (4)
  • main/config.h
  • main/main.c
  • modules/infra/control/port.c
  • modules/infra/datapath/main_loop.c

Comment thread modules/infra/datapath/main_loop.c Outdated

@vjardin vjardin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sched_setattr() shall be used from glibc.

Comment thread modules/infra/datapath/main_loop.c Outdated
// utilization and downclocks the core even at line rate. Pin uclamp_min to the
// max capacity: the governor runs the core at full speed while the worker is
// runnable and lets it drop only when it actually sleeps on the interrupt.
// glibc exposes neither struct sched_attr nor a sched_setattr() wrapper.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MortenBroerup

Copy link
Copy Markdown
Contributor

Disclaimer: I haven't looked at the patch in detail, so my feedback is high-level only.

Have you measured the RX interrupt wakeup latency? In other words: Are user-space interrupts fast enough for general use, or only for low-traffic hours?
AFAIK, sleep(), usleep() and nanosleep() all end up in the nanosleep() syscall.
With the default timerslack of 50 µs, nanosleep() wakeup latency exceeds 50 µs.
With timerslack 1 - prctl(PR_SET_TIMERSLACK, 1, 0, 0, 0) - I have seen nanosleep() wakeup latency around 2.5 µs on hardware, and 15-20 µs on a VMware virtual appliance.

Another detail:
We considered using RX interrupts in the SmartShare too, and I like the concept.
You should consider that other events may want to trigger graph wakeup too...
E.g. a host-originating packet (e.g. a CLI output via SSH).
Or a timer wheel or other pollling-based dataplane timer triggering some high-frequency event, e.g. for calling rte_sched_port_dequeue().

Add an opt-in --napi mode where an idle worker arms the interrupts on
its rx queues and blocks on them through the generic
rte_eth_dev_rx_intr_* / rte_epoll_wait API instead of busy-polling. A
packet wakes the worker, which disarms and resumes polling: the usual
poll/interrupt hybrid, with the interrupt acting only as a doorbell
since frames are still pulled by the graph walk.

A worker blocks only after staying idle for NAPI_EMPTY_WINDOWS
housekeeping windows with all of its queues empty, so a single busy
queue keeps it polling. --napi implies poll-mode and replaces the
micro-sleep ramp with the interrupt block. As that block can last up
to a second it is measured explicitly and the timestamp advanced past
it, keeping the sleep in total_cycles but out of busy_cycles.

napi_wait() tracks the queues it actually armed and disarms them
through a single exit path, so a queue without interrupt support does
not leave its predecessors armed, and only marks a queue
epoll-registered once. A PMD without rx queue interrupt support keeps
polling.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
In --napi mode an idle worker blocks on the rxq interrupt, so the
schedutil governor sees a low utilization and downclocks the core even
when it later runs at line rate. Pin the worker's uclamp_min to the max
capacity through sched_setattr(): the governor keeps the core at full
speed while the worker is runnable and lets it drop only when it
actually sleeps on the interrupt. glibc exposes neither struct
sched_attr nor a sched_setattr() wrapper, so both are declared locally.

The syscall fails on kernels without uclamp support or without the
privilege to set it, which would otherwise warn on every worker. Report
the expected EOPNOTSUPP/ENOSYS/EPERM/EINVAL cases at NOTICE and keep
WARNING for anything unexpected.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The issue is that attr (used with pthread_attr_destroy) may
be uninitialized if eventfd() fails and the function jumps to end; to fix,
ensure pthread_attr_t attr is initialized before any early goto that can skip
pthread_attr_init or rearrange the control flow so pthread_attr_destroy is only
called when pthread_attr_init succeeded: either move the
pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90a04b41-6c86-4ef5-a98d-1c2a45c64c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 5d8ba94 and a5d9ec9.

📒 Files selected for processing (7)
  • main/config.h
  • main/main.c
  • meson.build
  • modules/infra/control/port.c
  • modules/infra/control/worker.c
  • modules/infra/control/worker.h
  • modules/infra/datapath/main_loop.c
✅ Files skipped from review due to trivial changes (1)
  • main/config.h
🚧 Files skipped from review as they are similar to previous changes (3)
  • main/main.c
  • modules/infra/control/port.c
  • modules/infra/datapath/main_loop.c

Comment on lines +50 to +54
worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
if (worker->wakeup_fd < 0) {
ret = errno;
goto end;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Undefined behavior: pthread_attr_destroy called on uninitialized attr when eventfd() fails.

If eventfd() fails, the goto end at line 53 jumps past pthread_attr_init() at line 58. Then pthread_attr_destroy(&attr) at line 77 is called on an uninitialized stack variable.

Proposed fix: initialize attr before eventfd or restructure gotos
+	pthread_attr_init(&attr);
+
 	worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
 	if (worker->wakeup_fd < 0) {
 		ret = errno;
 		goto end;
 	}

 	CPU_ZERO(&cpuset);
 	CPU_SET(cpu_id, &cpuset);
-	pthread_attr_init(&attr);
 	if (!!(ret = pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset)))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/worker.c` around lines 50 - 54, The issue is that attr
(used with pthread_attr_destroy) may be uninitialized if eventfd() fails and the
function jumps to end; to fix, ensure pthread_attr_t attr is initialized before
any early goto that can skip pthread_attr_init or rearrange the control flow so
pthread_attr_destroy is only called when pthread_attr_init succeeded: either
move the pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.

@maxime-leroy

Copy link
Copy Markdown
Collaborator Author

Thanks @MortenBroerup , both points are spot on. Let me split them.

Wakeup latency / timerslack

The timerslack concern doesn't apply to the napi path: an idle worker doesn't
sleep on a timer, it blocks on rte_epoll_wait() on the rx queue interrupt
eventfd (VFIO/UIO), so the wakeup is interrupt-driven and timerslack never
enters the picture. timerslack only widens the timeout part of a sleep/poll,
not the delivery of a real fd event. (For completeness, the existing
sleep-based poll mode already lowers timerslack to 1us via
PR_SET_TIMERSLACK at worker start, so it isn't stuck at the 50us default
either.)

So the dominant latency term isn't the scheduler, it's the PMD's interrupt
coalescing. The IRQ fires on the first of "N frames queued" or "holdoff timer
expires". On DPAA2 the default is threshold = 7 frames, holdoff = 100us. Under
any real load the frame threshold trips first and the holdoff never matters;
the holdoff only adds latency in the trickle case (1-2 frames then silence),
i.e. exactly the near-idle regime this mode targets. Worst-case single-packet
wakeup at trickle is therefore bounded by the holdoff (~100us on DPAA2,
tunable via DPAA2_PORTAL_INTR_TIMEOUT).

To answer the underlying question directly: this is opt-in (--napi) and aimed
at cutting power/clocks during quiet periods, not at minimum-latency max-pps
forwarding. It's "general use during low/moderate load", and you'd leave it off
on a latency-critical fast path. I haven't done a rigorous latency sweep across
PMDs yet; if that would help I can gather numbers.

Other wakeup sources

Agreed this is the important design point. Two cases:

Control-plane events already work: the worker's own wakeup eventfd is in the
same per-thread epoll set, so a reconfig or shutdown breaks the block
immediately instead of waiting for a packet. That's the generic kick path, so
anything host-originated that needs to poke a sleeping worker can reuse it.

Datapath timers: today grout's datapath is purely packet-driven, there's no
timer wheel, no rte_sched, nothing time-driven that a sleeping worker would
starve, so blocking until the next packet is safe as-is. The day a periodic
fast-path event shows up (QoS dequeue / shaping via rte_sched_port_dequeue(),
TX pacing, datapath aging), an indefinite block would indeed starve it, and the
fix is the standard one: add a timerfd to the same epoll set (or cap the
epoll_wait timeout to the next deadline) so the worker wakes on a packet or a
deadline. Same shape as the wakeup eventfd that's already wired in, so the
extension point is there when it's needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants