viona: support (enough) `VIRTIO_NET_F_CTRL_RX` to leave all-multicast mode by FelixMcFelix · Pull Request #1125 · oxidecomputer/propolis

FelixMcFelix · 2026-04-24T10:02:14Z

This PR adds support for the VIRTIO_NET_F_CTRL_RX capability, which allows guests to request specific MAC address filters in addition to their primary MAC, and to explicitly inform the host whether the device should be shifted into promiscuous mode.

The hope is that this allows us to take a guest out of all-multicast promiscuous mode in viona itself, which generates extra per-packet work (checking "is this packet multicast?") and contention in the Tx and Rx paths as both reach for and refcount the same list of promisc callbacks on the client.

Testing in falcon makes it look as though Debian will just immediately insert multicast entries, although I don't know if that's purely an artefact of my environment. Using one of the helios-dev images appears to correctly move out of promiscuous mode and keep the filter tables empty.

Closes #1122. Closes #1127.

Closes #1122.

Well, that was an adventure. Closes 1127.

FelixMcFelix · 2026-04-29T09:51:14Z

I've tested this so far on Alpine Linux 3.23.3, debian 13.2, and the helios-2.8 image we use for a4x2 testing. The first two are multiqueue-capable, whereas illumos is not -- control requests appear to be correctly served on queue 22 for the former and queue 2 for the latter.

The typical initialisation sequence looks to be that guests will send explicit promisc off signals for alongside every MAC filter update. snoop -rd vioif0 will also have illumos explicitly move us in and out of promiscuous mode.

Running on top of OPTE and/or in omicron standalone doesn't really stop Linux from asking for multicast MACs (i.e., they're not a result of anything coming in link-local). So for most guests we will need to get explicit MAC filter support into viona proper, or they'll just put themselves back into all-multicast mode. illumos, at least, keeps the tables empty. On Debian, I'm seeing:

Got MAC list [
    MacAddr([33, 33, 0, 0, 0, 1]), MacAddr([1, 0, 5e, 0, 0, 1]), MacAddr([1, 80, c2, 0, 0, 0]),
    MacAddr([1, 80, c2, 0, 0, 3]), MacAddr([1, 80, c2, 0, 0, e]), MacAddr([33, 33, ff, e7, 9c, e4]),
    MacAddr([33, 33, 0, 1, 0, 3]), MacAddr([1, 0, 5e, 0, 0, fc]), MacAddr([33, 33, ff, 0, 6, da])
]

With visible group subscriptions on:

IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      224.0.0.1
enp0s6          1      224.0.0.252
enp0s6          1      224.0.0.1
lo              1      ff02::1
lo              1      ff01::1
enp0s6          1      ff02::1:ff00:6da
enp0s6          1      ff02::1:3
enp0s6          2      ff02::1:ffe7:9ce4
enp0s6          2      ff02::1
enp0s6          1      ff01::1

So the MAC table can be decoded as:

IPv6(ff01::1 [All Nodes, Node Local], ff02::1 [All Nodes, Link Local]),
IPv4(224.0.0.1 [All Systems on Subnet]),
STP/LLDP(additional),
LLDP(additional),
LLDP(primary)/PTP,
IPv6(ff02::1:ffe7:9ce4 [solicited node addr]),
IPv6(ff02::1:3 [LLMNR]),
IPv4(224.0.0.252 [LLMNR]),
IPv6(ff02::1:ff00:6da [solicited node addr])

ip link set multicast off dev enp0s6 still leaves a bunch of active MACs including registered multicast addresses for all-/solicited-node, but those are sort of intrinsic to the stack. I think I'm happy at least that we're at parity with how cbhyve uses this machinery in pci_viona_eval_promisc.

FelixMcFelix · 2026-04-29T11:40:39Z

-        if let Some(ctlq) = queues.get(2) {
-            ctlq.set_control();
-        }


I don't know what effect putting receiveq2 in this state would have, but we never unset this flag when the control queue index updated. So in theory this would have affected ring_init/reset/kick/... for this queue?

yeah... among other things we'd entirely fumble the queue state across migration. pretty bad, thanks for noticing this.

FelixMcFelix · 2026-04-29T11:45:26Z

+        let input: migrate::VionaStateV1 = offer.take()?;
+
+        let mut state = self.inner.lock().unwrap();
+        state.filter = FilterState::from_bits_retain(input.filter);
+        state.unicast_mac_filters = input.unicast_mac_filters.into();
+        state.unicast_mac_filters = input.multicast_mac_filters.into();
+        self.set_promisc(input.promisc, &mut state).map_err(|_| {
+            MigrateStateError::ImportFailed(format!(
+                "Could not move device promisc level to {:?}.",
+                input.promisc
+            ))
+        })


I'm pretty unfamiliar with the propolis migration machinery, hopefully this does the job?

iximeow

sorry for the slow turnaround on this review! a couple of actual issues, a couple of me-trying-to-update-my-mental-model :)

iximeow · 2026-05-04T19:41:59Z

 }

+bitflags! {
+    /// Packet receive filters requested by a virtio NIC driver.


would it be right to add here something like: "filter configuration is only a binary state, and the device is a composition of those filter configurations. Note that VirtIO RX commands only operate on individual filters using consecutive integers and correspond to bit numbers here only for simplicity"?

I've expanded this in 33aa6b2, let me know if that suffices.

iximeow · 2026-05-04T20:38:05Z

+            .ok()
+            .and_then(|i| self.virtio_state.queues.get(i))
+            .map(|v| v.set_control())
+            .ok_or_else(|| ())?;


on one hand, this is all well within its rights to be expectful: the queue index should be in-range from above so try_into and get should both always succeed.

on the other.. queues.get() has a special-case for max_capacity() - 1 in which case it returns get_control() getting us a "control" queue which we have not actually made the control queue yet? kinda odd!

leaning further into the XXX: in VirtQueues::get maybe we should have set_len() update some queues-internal state that says who is currently the control queue (and unsets is_control if there was a previous one). then we avoid this whole "VirtQueues and virtio-nic have their own ideas of control queues" problem.

comparing with other VirtIO devices it seems like in the longer-run the question of "is this a control queue" needs to be answered wholly by the device impl, and that it's not really coherent as a generic property (some devices have 0 queues, some have two, most of the time they're the first/second except here, ... etc). so keeping this in VirtQueues is an OK step until we have other control-queue-having VirtIO devices?

I've tried to come up with something that works here. VirtQueues will always need additional information about what indices the control queues are going to be, otherwise we're baking in a heavily NIC-specific assumption across all devices. And it's a really weird one too, being that it needs to swap between len() + 1 and max_capacity() + 1. I've set up an additional method for a device to tell VirtQueue what it considers as control devices -- let me know if what I have now works?

iximeow · 2026-05-04T20:46:27Z

+        let need_mcast = state.filter.contains(FilterState::ALL_MULTICAST)
+            || !state.multicast_mac_filters.is_empty();


please bear with my rusty networking brain: if the filter has ALL_MULTICAST | NO_MULTICAST then do we still need_mcast? I'm not sure how broadcast fits in, is really the question. it seems more obvious to me that we shouldn't need_mcast if ALL_MULTICAST | NO_MULTICAST | NO_BROADCAST. in any case, having viona more promiscuous than required is not a correctness issue iiuc, so we don't need to worry about these cases?

(if that's right, imo worth a note on fn needed_promisc and I'm not worried about not filtering as much as silly combinations of filters could permit)

Hmm, part of this is that a bunch of the features in VIRTIO_NET_F_CTRL_RX_EXTRA can't be expressed in the viona ioctl interface at all, so I didn't want them to complicate the logic there when we can only support the subset you've pointed out. I'll add a comment explaining why I'm not accounting for that block of features yet.

iximeow · 2026-05-04T20:58:29Z

+        match self.set_promisc(self.needed_promisc(&state), &mut state) {
+            Ok(()) => Ok(()),
+            Err(()) => {
+                state.unicast_mac_filters = unicast;
+                state.multicast_mac_filters = multicast;
+                Err(())
+            }


I feel like I'm missing.. two things. if set_promisc errors, do we in general know anything about the state of viona? I think we would want that to result in the device having the status DEVICE_NEEDS_RESET, but right now this only results in an Ack::Err. I don't know if this is consistent with what cbhyve or other virtio-nic devices do.

the other thing being: shouldn't I expect to see somewhere we communicate these filters to viona? or no, and this is all fine because there's no guarantee these are the only packets a guest would see, just the hope that the NIC/VNIC have prevented a guest from having to see them?

I think I agree with you here, I'll change this over to mark the device as needing reset iff. we cannot rollback. cbyhve doesn't actually store the filters, and from what I can see it simply logs a warning then returns VIRTIO_NET_CQ_ERR (Ack::Err).

On the second point: yeah, we do want to do that, but viona can't handle that quite yet. I need to actually scope out what the right ioctl shape would be and have viona able to plumb it down on its MAC client. That will be a necessary part of actually making the promisc handler go away on the majority of guests. The spec does allow us to do this (virtio v1.4, §5.1.9.5.2), and it's the strategy used by cbhyve:

The device can filter incoming packets by any number of destination MAC addresses [3]

The device SHOULD drop incoming packets which have a destination MAC which matches neither the mac
(or that set with VIRTIO_NET_CTRL_MAC_ADDR_SET) nor the MAC filtering table.

[3] Since there are no guarantees, it can use a hash filter or silently switch to allmulti or promiscuous mode if it is given too many addresses.

Unfortunately 'too many addresses' is... one. But we're at least ensuring that they see that traffic! Nevermind that OPTE is unerringly strict about the MACs it will send and receive.

The first part should be covered in 33aa6b2. I'll open a stlouis ticket on the latter.

Opened https://github.com/oxidecomputer/stlouis/issues/986.

iximeow · 2026-05-04T21:14:33Z

+/// The index of the control queue when multiqueue ([`VIRTIO_NET_F_MQ`]) has
+/// not been negotiated.
+///
+/// In this case, the driver will behave as though we have allocated only one
+/// Rx/Tx queue pair, followed by the control queue.
+pub const VIRTIO_NO_MQ_CTRL_Q_INDEX: usize = 2;


fwiw part of the problem with picking 2 here on our own is that VirtQueues::iter() will still use get_control() to tack on a control queue at the end, so reset_queues_locked() will reset queues 0, 1, and propolis-VirtQueue-22 and I think leave queue 2 as it was before the reset. that's what has me thinking about keeping the notion of control-ness totally internal to VirtQueues for now, rather than splitting it between there and virtio/viona.rs.

I've tried to go for a mechanism where we provide a list of control queue indices separately from the main number of queues, and adapt iter/iter_all to account for the cases where our control queues might come after a gap. My thinking here is that the other device families look sorta... arbitrary in where they put their control queue(s), but viona in particular may not have a control queue, and it may be directly after the in-use queues or at the very end of the maximum. I think it's right that PciVirtioViona is the one to make that call, given it depends wholly on the negotiated features.

FelixMcFelix · 2026-05-27T17:29:34Z

-            .queues
-            .set_len(npairs * 2 + 1)
-            .expect("num queue pairs");
+        self.virtio_state.queues.set_len(npairs * 2).expect("num queue pairs");


I'm a little torn on something here and in the concept of len as used between VirtQueues and PciVirtioViona, and would really like some more input. Conceptually it is right that we have 2 * n_pairs + 1 active queue pairs if we've negotiated a control queue. But VirtQueues will treat the first len entries as being valid -- if we get a set_use_pairs(2) against a maximum max_pairs = 11 we'll mistakenly treat queue 5 as enabled, and iter/iter_all will return queues [0, 5] ∪ { 22 }. So I've picked n_pairs * 2 to at least return the right queues in the iterators.

But then both before and after this change, queues.len() and queues.iter().len() are different quantities. Is that something we care about? Do we need to document this difference, or store two counts in the VQ list?

viona: support (enough) VIRTIO_NET_F_CTRL_RX to remove promisc

0357b54

Closes #1122.

FelixMcFelix changed the title ~~viona: support (enough) VIRTIO_NET_F_CTRL_RX to remove promisc~~ viona: support (enough) VIRTIO_NET_F_CTRL_RX to leave all-multicast mode Apr 24, 2026

FelixMcFelix added 2 commits April 28, 2026 11:47

Actually get it all working.

a2e3b11

Well, that was an adventure. Closes 1127.

Actually move filter tables and associated state during migration.

0d57644

FelixMcFelix marked this pull request as ready for review April 29, 2026 11:28

FelixMcFelix requested review from dancrossnyc and iximeow April 29, 2026 11:28

FelixMcFelix commented Apr 29, 2026

View reviewed changes

Merge branch 'master' into felixmcfelix/viona-guest-promisc

3e58d8a

iximeow reviewed May 4, 2026

View reviewed changes

Comment thread lib/propolis/src/hw/virtio/viona.rs

FelixMcFelix added 6 commits May 7, 2026 17:38

Merge branch 'master' into felixmcfelix/viona-guest-promisc

6d38cc6

Merge branch 'master' into felixmcfelix/viona-guest-promisc

82b3147

Review feedback.

33aa6b2

Clippy..

fc86ff3

Debugging...

5cc7af5

Got it.

ed23f4b

FelixMcFelix commented May 27, 2026

View reviewed changes

Ensure we unset the control queue if we re-init without CTRL_VQ

36cb757

		let need_mcast = state.filter.contains(FilterState::ALL_MULTICAST)
		\|\| !state.multicast_mac_filters.is_empty();

Conversation

FelixMcFelix commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FelixMcFelix commented Apr 29, 2026

Uh oh!

FelixMcFelix Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FelixMcFelix May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FelixMcFelix May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FelixMcFelix commented Apr 24, 2026 •

edited

Loading

FelixMcFelix Apr 29, 2026 •

edited

Loading

FelixMcFelix May 27, 2026 •

edited

Loading

FelixMcFelix May 27, 2026 •

edited

Loading