Skip to content

Priority updates#920

Draft
dvush wants to merge 13 commits intodevelopfrom
priority_updates
Draft

Priority updates#920
dvush wants to merge 13 commits intodevelopfrom
priority_updates

Conversation

@dvush
Copy link
Copy Markdown
Contributor

@dvush dvush commented Apr 27, 2026

📝 Summary

WIP

💡 Motivation and Context


✅ I have completed the following steps:

  • Run make lint
  • Run make test
  • Added tests (if applicable)

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 27, 2026

Claude finished @dvush's task in 7m 39s —— View job


Claude PR Review

  • Read priority_update files
  • Review PU integration in builders
  • Review live_builder PU classifier wiring
  • Post inline comments

Summary

The new priority-update pipeline is plumbed through ingress → classifier → PUR sim → per-worker overlays → builder commits in a way that mostly hangs together, but a few real correctness issues showed up. Inline comments are posted on the specific lines.

Bugs

  1. Conflict resolver drops PU profit. parallel_builder/conflict_resolvers.rs only adds commit_result.order's coinbase profit into total_profit; the commit_result.priority_updates profits are ignored. This biases group selection in the assembler and makes simulation_cache::CachedSimulationState::total_profit inconsistent with the cached BundleState.

  2. Non-deterministic PU commit order. PendingStateDb::into_used_pu_slots returns HashMap::into_values(), so the slot order — and therefore the order in which PUs are committed in front of a regular order — varies between runs/threads. Two builders/resolvers can produce different state and total_profit for the same group of orders. Sort by OrderId (or use a BTreeMap).

  3. RoutingSink::remove_order always returns None for PU pool removals, even when something was evicted. Either return the evicted PU or document the trait change.

  4. PUR Insert send error swallowed. PURCommandClassifier returns true and removes the order from the main pipeline regardless of whether the PUR worker is alive. If run_pur_sim_worker returns early (e.g. parent state lookup fails), every PU-classified order is silently dropped.

Other issues / smells

  1. PU back-pressure can stall the PUR thread. apply_update does sub.send(cmd).await over a bounded 10k channel for every subscriber. One slow sim worker (holding pool_arc.lock() during simulate_order) can stall PU classification and cancellation for everyone. Consider try_send + drop counter.

  2. PUs don't compose during simulation. simulate_priority_update runs against parent state with no PU overlay, so the pu_data.changeset stored in PendingUpdates reflects on-chain values, not values another already-pending PU would have produced. At commit time the actual sequence may diverge from this simulated changeset, which is also what's used for conflict eviction. Document or fix.

  3. Per-worker Arc<Mutex<PriorityUpdatePool>> is uncontended by construction in sim_worker.rs, but the lock is held across the whole simulate_order call. Document the invariant or drop the Mutex.

  4. Memory: PUData::changeset is held twice — once in PendingUpdates::orders and once in SimulatedOrder.pu_data. Wrap in Arc or move out.

  5. apply_update's error! on pu_data == None is unreachable by every current caller (they pre-check). Make it a debug_assert! or invariant comment.

Pre-existing

  • block_building_result_assembler.rs still has the panic!("TODO: next_journal_sequence_number not set ...") and a hardcoded 0 in BlockBuildingHelperFromProvider::new. This predates the PR but build_new_block cannot succeed at runtime — worth fixing alongside the new force-TOB plumbing it now contains.

The PR description says WIP and the PR checklist (make lint, make test, tests added) is unchecked, so some of these may be known. Flagging them here so they aren't lost.

&|_| Ok(()),
)? {
&priority_update_pool.read(),
)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: PU profit is not accumulated into total_profit.

partial_block.commit_order may execute one or more priority-update orders in front of sim_order, and each successful PU contributes coinbase profit (see commit_order_inner in building/mod.rs, where self.coinbase_profit += priority_update.coinbase_profit). Below, only res.coinbase_profit (the main order) feeds handle_successful_committotal_profit. The PU profits in commit_result.priority_updates are dropped on the floor.

Consequences:

  • ResolutionResult::total_profit underestimates each group's profit, biasing the assembler's "sort groups by total_profit" against groups whose orders trigger PUs.
  • simulation_cache::CachedSimulationState::total_profit becomes inconsistent with the cached BundleState, so resumed runs will produce different totals depending on whether they hit the cache.

Sum the Ok entries in commit_result.priority_updates into total_profit (and probably into per_order_profits as well, with the PU's OrderId).

Comment on lines +99 to +101
pub fn into_used_pu_slots(self) -> Vec<SlotKey> {
self.used_pu_slots.into_values().collect()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-determinism in priority-update commit order.

used_pu_slots is a HashMap<OrderId, SlotKey>, and into_values() returns its values in unspecified order. The result is fed via simulated_order.used_priority_updates into PriorityUpdatePool::get_updates, which preserves the input slot order when building its returned Vec<&Order>. So the order in which priority updates are committed in front of a regular order is non-deterministic between runs / between threads.

Two builders or two parallel resolvers can produce different state for the same group of orders depending on hash randomization, which feeds straight into total_profit and the cached BundleState. Sort the result (e.g. by OrderId) before returning, or store in a BTreeMap, to make commit order deterministic.

Comment on lines +79 to +82
fn remove_order(&mut self, id: OrderId) -> Option<Arc<SimulatedOrder>> {
self.pool.apply_remove(&id);
self.store.remove_order(id)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PriorityUpdatePool::apply_remove returns nothing, and store.remove_order(id) returns None for any id that lives only in the PU pool. So RoutingSink::remove_order always returns None for PU-pool removals, even when the order was actually evicted. Callers using the return value (e.g. OrderIntakeConsumer::remove_orders chains in ordering_builder.rs) cannot detect that a PU was removed.

If the trait contract is "return Some(order) iff something was removed", track and return the evicted PU; otherwise document that PU removals are silent.

Comment on lines +60 to +85
impl PURCommandClassifier {
pub fn try_consuming_new_order_command(&self, cmd: &OrderPoolCommand) -> bool {
match cmd {
OrderPoolCommand::Insert(order) => {
if order.metadata().priority_update_data.is_none() {
return false;
}
self.inner.tracked_orders.lock().insert(order.id());
let _ = self
.inner
.cmd_sender
.send(OrderPoolCommand::Insert(Arc::clone(order)));
true
}
OrderPoolCommand::Remove(id) => {
let known = self.inner.tracked_orders.lock().remove(id);
if known {
let _ = self.inner.cmd_sender.send(OrderPoolCommand::Remove(*id));
true
} else {
false
}
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few classification edge cases to consider:

  1. try_consuming_new_order_command ignores send errors with let _ = ...send(...). For an unbounded_channel the only failure is "receiver dropped" (i.e. the PUR worker has exited). The classifier still returns true, swallowing the order from the main pipeline — so when the PUR thread dies (e.g. parent state lookup failed, see run_pur_sim_worker early-return), every PU-classified order is silently dropped instead of falling back to the regular pipeline. Worth at least a metric / warn.

  2. Remove only forwards if the id was previously seen by Insert (tracked_orders membership). If a Remove arrives for a PU-classified id from a source that classifies differently than this side did (e.g. classifier was added/removed, or the classification rules changed), the cancellation gets routed to the regular pipeline, where it will be a no-op — and the PUR pool still holds the order.

  3. Holding a parking_lot::Mutex is overkill if try_consuming_new_order_command is only ever called from SimulationJob::process_new_commands (a single async task). Consider documenting the threading expectation, or using a RefCell if not, or removing the lock entirely.

Comment on lines +63 to +75
let mut changeset = bundle_state.to_plain_state(OriginalValuesKnown::Yes);

if !changeset.contracts.is_empty() {
debug!(
reason = "changeset contains contracts",
"priority update discarded"
);
return Ok(None);
}

changeset
.storage
.retain(|s| s.address != coinbase && !senders.contains(&s.address));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two correctness questions worth a comment in the code:

  1. The PU is simulated against the parent state with no PU overlay. So PUs are not composed — if PUA writes slot X and PUB reads slot X to decide what to write, PUB's simulation here uses the on-chain value, not PUA's overlay. At commit time (partial_block.commit_order), they may execute against a different sequence of slot values, potentially diverging from the simulated changeset that's stored in pu_data.changeset and used for conflict eviction in PendingUpdates::add_new_simulated_update. Worth being explicit about this assumption.

  2. changeset.storage.retain(|s| s.address != coinbase && !senders.contains(&s.address)) filters by sender/coinbase address, but a PU bundle may include several txs whose internal calls touch storage on contracts that happen to live at a sender's address (rare but possible). More importantly, the inverse: writes the PU performs to contract storage are kept, which is what we want — so this is fine. Just make sure the filter is intentional re: senders specifically.

Comment on lines +47 to +68
pub fn apply_update(&mut self, sim_order: Arc<SimulatedOrder>) -> Vec<OrderId> {
let Some(pu_data) = sim_order.pu_data.clone() else {
error!(order_id = ?sim_order.id(), "apply_update called with non-PU simulated order");
return Vec::new();
};
let order_id = sim_order.id();
let evicted = self
.pending
.add_new_simulated_update(order_id, pu_data.changeset);
for id in &evicted {
self.orders.remove(id);
self.force_top_of_block.remove(id);
}
if matches!(
sim_order.order.metadata().priority_update_data,
Some(PriorityUpdateClass::ForceTopOfBlock)
) {
self.force_top_of_block
.insert(order_id, Arc::clone(&sim_order));
}
self.orders.insert(order_id, sim_order);
evicted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

  • sim_order.pu_data.clone() clones the entire Vec<PlainStorageChangeset>, then it's moved into pending in add_new_simulated_update. Meanwhile the same changeset is also kept inside sim_order (which is then Arc::cloned into self.orders). So every PU stores its changeset twice. With many PUs this is non-trivial memory; consider wrapping PUData::changeset in Arc<...> or moving it out of SimulatedOrder entirely once it's been "consumed" into PendingUpdates.

  • error! on pu_data being None is reasonable, but the only caller (PUSimWorkerOrderpool::consume_updates and apply_update from PUR worker) already checks sim_order.pu_data.is_some(). So this branch should be unreachable — make it a debug_assert! or document the invariant.

Comment on lines +152 to +166
impl PUSimulationWorkerState {
async fn apply_update(&self, sim_order: Arc<SimulatedOrder>) -> Vec<OrderId> {
// Sync critical section: mutate pool, prune closed subs, snapshot subs.
let (evicted, subs) = {
let mut g = self.inner.lock();
let evicted = g.pool.apply_update(Arc::clone(&sim_order));
g.subscribers.retain(|s| !s.is_closed());
(evicted, g.subscribers.clone())
};
let cmd = SimulatedOrderCommand::Simulation(sim_order);
for sub in subs {
let _ = sub.send(cmd.clone()).await;
}
evicted
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_update clones g.subscribers while holding the lock, then awaits sub.send(cmd.clone()).await for each subscriber outside the lock. The subscriber channel is bounded (PU_SUBSCRIBER_CHANNEL_CAPACITY = 10_000), and send().await blocks the PUR worker's task if any subscriber is full.

If one sim worker stalls (e.g. an expensive simulate_order call holding pool_arc.lock() so consume_updates() can't drain), the PUR thread back-pressures all PU classification, including the Cancellation path. That stops new PU updates from reaching even healthy sim workers and the assembler/builder.

Consider try_send with a counter on drops, or break out the slow-subscriber detection so a single stuck worker doesn't take down the whole PU pipeline. At minimum, log when a send actually has to wait.

Comment on lines +73 to +92
pu_orderpool.consume_updates();

let order_id = task.order.id();
let start_time = Instant::now();
let mut block_state = BlockState::new_arc(state_provider.clone());
let sim_result = simulate_order(
task.parents.clone(),
task.order,
&current_sim_context.block_ctx,
&mut local_ctx,
&mut block_state,
);
let sim_result = {
let cached = CachedDB::new(
parent_state.clone(),
current_sim_context.block_ctx.shared_cached_reads.clone(),
);
let pool_arc = pu_orderpool.pool();
let pool_guard = pool_arc.lock();
simulate_order(
task.parents.clone(),
task.order,
&current_sim_context.block_ctx,
&mut local_ctx,
&pool_guard,
cached,
)
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool_arc.lock() is held for the entire duration of simulate_order (which can run a full bundle EVM). consume_updates() also locks the same mutex. Today this is fine because each worker has its own per-worker PUSimWorkerOrderpool and is single-threaded — the lock is uncontended.

But if a future refactor shares the worker pool across tasks, or if consume_updates ever runs concurrently (e.g. driven from a fan-out task), this serializes everything on the slowest sim. Worth either documenting "this lock is uncontended by construction" or removing the Mutex indirection entirely (the worker owns the pool).

@dvush dvush force-pushed the priority_updates branch from 13f45ec to 19c887b Compare April 28, 2026 16:30
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 28, 2026

Claude finished @dvush's task in 5m 53s —— View job


Claude PR Review (update)

  • Check what changed since last review
  • Review priority_update modules
  • Check if previous review's bugs are still present
  • Review new force-TOB code
  • Post inline comments

Summary

This update reviews commits since the last review (d81389d..19c887b), which add force-TOB classification (PriorityUpdateClass::ForceTopOfBlock), a top-of-block commit pass, and a "skip PU if used slot is overwritten in bundle" guard. Most of the previous review still applies — flagging which findings persist and what's new from the recent commits.

New issues introduced by the latest commits

  1. build_new_block panics regardless of new code. block_building_result_assembler.rs:298 still has panic!("TODO: next_journal_sequence_number not set ...") before the Ok(...) return, so the new force-TOB plumbing added at lines 200–218 is dead at runtime in the live path. Inline comment posted. The same code in build_backtest_block is exercisable, the live path isn't. (Pre-existing TODO, but the PR is now adding correctness-relevant logic into a function that cannot complete.)

  2. Force-TOB orders may be double-executed via PU overlay. commit_force_top_of_block_orders in block_building_helper.rs:56 iterates force-TOB orders and calls commit_order on each. commit_order_inner then pulls in PUs from priority_update_pool.get_updates, which can include other force-TOB orders (they live in the same pool). If A's used_priority_updates references B's slots, A's commit pre-executes B; the loop then attempts B at top level, which fails (nonce reused) and traces success=false. Inline comment posted.

  3. Resolver/live state divergence around force-TOB. Conflict resolution computes total_profit and a cached BundleState against parent state with no force-TOB pre-committed; build_new_block commits force-TOB first and then applies the resolved groups. The cached state may not match live state at the boundary, biasing the assembler's profit-sort and risking divergence in SharedSimulationCache. Inline comment posted.

  4. slot_overwritten_in_bundle (priority_update/mod.rs:120) edge cases. Relies on StorageSlot::is_changed() which is present != original; a write that restores the slot to its original value will report unchanged, so the PU is re-applied on top. Selfdestruct/wipe is also not specially handled. Inline comment posted.

Previous review's findings — still present (not flagged inline this run; see prior review for context)

  • PU profit dropped from total_profit in parallel_builder/conflict_resolvers.rs:226 (only res.coinbase_profit is summed; commit_result.priority_updates profits are ignored).
  • Non-deterministic PU commit order via PendingStateDb::into_used_pu_slots returning HashMap::into_values().
  • RoutingSink::remove_order returns None for PU-pool removals even when the PU was evicted.
  • PURCommandClassifier::try_consuming_new_order_command swallows Insert send errors with let _ = ...send(...). If run_pur_sim_worker early-returns (e.g. parent state lookup fails), every PU-classified order is silently dropped.
  • PUSimulationWorkerState::apply_update does sub.send(cmd).await on a bounded 10k channel for each subscriber — one stalled sim worker back-pressures the entire PUR pipeline (including cancellations).
  • simulate_priority_update runs PUs against parent state with no PU overlay, so PUs don't compose; the stored pu_data.changeset may not match what the same PU produces at commit time.
  • Per-worker Arc<Mutex<PriorityUpdatePool>> lock is held across simulate_order (full EVM run) — uncontended today, but a footgun for any future fan-out.
  • PUData::changeset is held twice (PendingUpdates::orders + SimulatedOrder.pu_data).
  • apply_update's error! on pu_data == None is unreachable from current callers.

The PR is still marked WIP and the author's checklist (lint/test/added tests) is unchecked, so a few of these may already be on the to-do list.

Comment on lines +200 to +218
for (sim_order, commit_result, order_commit_time) in block_building_helper
.commit_force_top_of_block_orders(
&mut self.local_ctx,
&self.priority_update_pool.read(),
)?
{
let (success, gas_used, execution_error) = match commit_result.order {
Ok(res) => (true, res.space_used.gas, None),
Err(err) => (false, 0, Some(err)),
};
trace!(
order_id = ?sim_order.id(),
success,
order_commit_time_mus = order_commit_time.as_micros(),
gas_used,
?execution_error,
"Executed force-TOB order"
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_new_block still panics at line 298 (panic!("TODO: next_journal_sequence_number not set ...")), so this new force-TOB block is dead code at runtime. The function is called from the live build path and aborts the builder thread before the Ok(...) return. The same block added in build_backtest_block is exercisable, but the live path isn't. Either fix the TODO in this PR, or hold this plumbing until the underlying BlockBuildingHelperFromProvider::new issue is resolved — right now the new logic added here cannot be exercised live and isn't covered by the test suite either.

Comment on lines +56 to +71
fn commit_force_top_of_block_orders(
&mut self,
local_ctx: &mut ThreadBlockBuildingContext,
priority_update_pool: &PriorityUpdatePool,
) -> Result<Vec<(Arc<SimulatedOrder>, OrderCommitResult, Duration)>, CriticalCommitOrderError>
{
let force_orders = priority_update_pool.force_top_of_block_orders();
let mut results = Vec::with_capacity(force_orders.len());
for sim_order in force_orders {
let start = Instant::now();
let commit_result =
self.commit_order(local_ctx, &sim_order, priority_update_pool, &|_| Ok(()))?;
results.push((sim_order, commit_result, start.elapsed()));
}
Ok(results)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force-TOB orders may be double-executed (and counted-then-failed) when they read each other's slots.

This loop iterates force-TOB orders sorted by OrderId and commits each via commit_order, which goes through commit_order_inner and calls priority_update_pool.get_updates(state, &order.used_priority_updates). used_priority_updates records every PU slot the order's simulation read — which includes slots written by other force-TOB PUs (they live in the same pool).

So if force-TOB order A is committed first and its used_priority_updates references a slot owned by force-TOB order B:

  1. commit_order for A will pre-execute B as a PU sub-step (consuming B's nonces, contributing B's profit to coinbase_profit).
  2. The loop then reaches B and calls commit_order(B, ...). B's top-level execution fails (nonce too low / already-included), and the trace logs success = false for B.

Net effect: B's profit is captured (good) but the visible status is misleading, and any downstream check on the per-order OrderCommitResult will treat B as failed.

Also: the regular-PU side of the same call may pull in non-force-TOB PUs, which were not intended to run at TOB. If that's the design, a comment makes it explicit; otherwise consider a dedicated commit_order_no_pu_overlay to prevent the cascade.

Comment on lines +77 to +85
/// Orders that must be committed at the top of every built block, sorted
/// by [`OrderId`] for deterministic inclusion order across builders. The
/// builder iterates this list once at the start of `build_block` and
/// commits each before the regular order loop runs.
pub fn force_top_of_block_orders(&self) -> Vec<Arc<SimulatedOrder>> {
let mut orders: Vec<_> = self.force_top_of_block.values().cloned().collect();
orders.sort_by_key(|sim| sim.id());
orders
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two follow-ups on the new force-TOB bucket:

  1. The conflict resolver (conflict_resolvers.rs::run_conflict_task) builds each group's total_profit against parent state with no force-TOB orders pre-committed. Then build_new_block / build_backtest_block commit force-TOB at the very top before appending the resolved groups. If a group order's used_priority_updates references a slot owned by a force-TOB order, the resolver still pre-executes that PU via commit_order_inner, so the value the order observes matches between resolver and live build — but the actual gas / refund metering during the resolver's PU sub-step could differ from the live slot_overwritten_in_bundle path (which now skips that PU). The cached total_profit and BundleState in simulation_cache::CachedSimulationState will then be inconsistent with the post-force-TOB live state. Worth either pre-committing force-TOB inside the resolver too, or invalidating the simulation cache when the force-TOB set changes.

  2. Sorting force-TOB by OrderId is deterministic, but OrderId order is essentially arbitrary from the user's perspective. If two orders both want the genuine top of block, a user can't predict who wins. If a tiebreaker exists in product semantics (e.g. who paid more, who arrived first), that should drive ordering rather than the hash.

Comment on lines +120 to +130
fn slot_overwritten_in_bundle<DB>(state: &BlockState<DB>, slot: &SlotKey) -> bool {
let Some(account) = state.bundle_state().state.get(&slot.address) else {
return false;
};
let key = U256::from_be_bytes(slot.key.0);
account
.storage
.get(&key)
.map(|s| s.is_changed())
.unwrap_or(false)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slot_overwritten_in_bundle will return false for slots an earlier order deleted/zeroed in this build.

StorageSlot::is_changed() returns present_value != original_value. If a previously-committed order writes a slot from 42 back to 42 (or the underlying revm bundle representation reports it as unchanged), this returns false — and the PU is then re-applied on top, even though the bundle state has effectively touched that slot already.

More importantly, this only handles the storage dimension. The PU's read may be of an account that another order already destroyed (selfdestruct → wipe_storage on the bundle). In that case account.storage doesn't contain the key; the function returns false and the PU is re-attempted.

These are edge cases, but the PR description ("skip PU if used slot is overwritten") implies safety here — worth a unit test covering: (a) revert-to-original write, (b) selfdestruct of the contract, (c) account never-touched-but-PU-pool-has-overlay (which works correctly, returns false, PU runs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant