Enable prefetching of SW kernel instructions after the first SW task by Kepontry · Pull Request #199 · openvinotoolkit/npu_compiler

Kepontry · 2025-11-07T08:18:23Z

Summary

This PR enhances the AddSwKernelInstructionPrefetchPass to enable prefetching of SHAVE kernel instructions after the first SHAVE task, if the initial slack is insufficient.

Currently, instruction prefetching is skipped if the slack before the first SHAVE task is insufficient. This limitation is suboptimal when initial insertion slots (tiles) are limited or L2 cache capacity is constrained.

Based on the observation that SHAVE utilization is often low, we propose this change to prefetch opportunistically later in the schedule. This approach has demonstrated a ~3% performance gain on models such as Qwen2-1.5b and Qwen3-0.6b.

Target Platform For Release Notes

NPU37XX
NPU40XX
NONE (Not included in release notes)

Classification of this Pull Request

Maintenance
BUG
Feature

Implementation Details

The new logic searches for insertion gaps that begin at a "non-saturated" point (where num_shave_tasks < available_shave_count).
The gap ends at either a "saturated" point or the kernel designated for prefetching.
The prefetch operation is inserted at the tile3 task of the identified insertion point.
The minimal insertion slack required is set to 50K cycles.

Additional Fixes & Enhancements

Corrected an issue in insertDummyKernelOpBeforeFirstKernelTask where the clusterIdx was not being used during tile assignment.
Expanded Prefetching: Enriched the "kernel kind" logic to allow more types of kernels to be prefetched.

We also noted that the previous 250K-cycle threshold is overly conservative for our platform (Ultra 258V). Our analysis shows that prefetching provides benefits even with a slack as low as 170K cycles.

DariaMityagina · 2025-11-07T08:49:00Z

@Kepontry hello! Thanks for your PR!

Could you please ensure your changes include test coverage by adding tests to https://github.com/openvinotoolkit/npu_compiler/blob/develop/tests/lit/NPU/dialect/VPUIP/passes/add_sw_kernel_instruction_prefetch_40XX.mlir and maybe some functional tests?

DariaMityagina · 2025-11-10T13:24:18Z

@Kepontry hello! Thanks for your PR!

Could you please ensure your changes include test coverage by adding tests to https://github.com/openvinotoolkit/npu_compiler/blob/develop/tests/lit/NPU/dialect/VPUIP/passes/add_sw_kernel_instruction_prefetch_40XX.mlir and maybe some functional tests?

@Kepontry could you please look into this comment? Thank you!

Kepontry · 2025-11-19T03:19:52Z

Apologies for the delay; I missed the email notification for this thread. I am currently working on adding the test. Could you provide some guidance or documentation on how to use the lit test framework within the NPU compiler?

Kepontry · 2025-11-19T16:11:35Z

Functional test added.

Maxim-Doronin · 2025-11-24T14:10:17Z

Hi @Kepontry! Please adhere to the clang-format guidelines. You will find the automatically fixed code style in the job logs: https://github.com/openvinotoolkit/npu_compiler/actions/runs/19637113134/job/56230572067?pr=199

I also noticed that some LIT tests failed. Could you please verify if these failures are due to your changes?
https://github.com/openvinotoolkit/npu_compiler/actions/runs/19632108154/job/56223833881

cc @DariaMityagina

Kepontry · 2025-11-24T17:29:11Z

Hi @Maxim-Doronin , the failure of LIT test is caused by the DummySWKernelsForInstructionPrefetchReservedMemory not being found. I can reproduce this error by setting the minimum-shave-start-time-for-prefetch threshold to 5 in the default_hw_mode_40XX test. So the problem exists before this PR. I suspect that the createSWKernelInstructionPrefetchReserveMemForDummyKernelsPass function in the VPU pipeline is not called, but I am not entirely sure. I would appreciate your help verifying this.

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" --add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5" default_hw_mode_40XX.mlir

DariaMityagina · 2025-11-25T13:44:39Z

Hi @Maxim-Doronin , the failure of LIT test is caused by the DummySWKernelsForInstructionPrefetchReservedMemory not being found. I can reproduce this error by setting the minimum-shave-start-time-for-prefetch threshold to 5 in the default_hw_mode_40XX test. So the problem exists before this PR. I suspect that the createSWKernelInstructionPrefetchReserveMemForDummyKernelsPass function in the VPU pipeline is not called, but I am not entirely sure. I would appreciate your help verifying this.

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" --add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5" default_hw_mode_40XX.mlir

Hello @Kepontry!

Thanks for adding the tests and sharing your findings regarding pre-commit failures!
I'll check them locally and get back to you.

DariaMityagina · 2025-11-27T14:03:06Z

@Kepontry I managed to reproduce the issue:

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU40XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" --add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5" default_hw_mode_40XX.mlir

->

Cannot find DummySWKernelsForInstructionPrefetchReservedMemory!

Will research it a bit and get back!

In the meantime, could you please share with us why you set this particular value?
--add-sw-kernel-instruction-prefetch="minimum-shave-start-time-for-prefetch=5"

Kepontry · 2025-11-28T02:51:31Z

Hi, @DariaMityagina , thanks for your assistance regarding this issue. Since this PR enables prefetching regardless of the first SHAVE task's start time threshold, it exposed some existing bugs in certain test cases. These bugs were previously hidden because there wasn't enough time slack to trigger the prefetch logic. I adjusted the threshold to simulate a scenario that forces prefetch insertion, confirming that these test cases fail without this PR.

Kepontry · 2025-12-09T07:20:46Z

Hi @DariaMityagina ,

I have fixed the failed tests. The root cause was that setDummySwKernelsForInstructionPrefetchReservedMemory is normally invoked during the VPU pipeline. Since these tests target the VPUIP pipeline in isolation, the required memory attribute was missing from the input module.

I fixed this by manually adding the DummySWKernelsForInstructionPrefetchReservedMemory resource to the MLIR files, following the pattern in tests in the tests/lit/NPU/dialect/VPUIP/passes directory. I have also resolved the clang-format issues.

DariaMityagina · 2025-12-15T06:08:50Z

Hi @DariaMityagina ,

I have fixed the failed tests. The root cause was that setDummySwKernelsForInstructionPrefetchReservedMemory is normally invoked during the VPU pipeline. Since these tests target the VPUIP pipeline in isolation, the required memory attribute was missing from the input module.

I fixed this by manually adding the DummySWKernelsForInstructionPrefetchReservedMemory resource to the MLIR files, following the pattern in tests in the tests/lit/NPU/dialect/VPUIP/passes directory. I have also resolved the clang-format issues.

Thanks a lot for the updates!
Let's wait for the precommit results. In the meantime, we'll do another round of reviews.

Kepontry · 2025-12-15T13:52:57Z

The failed log indicates an Assertion addr + size <= _totalSize failed. However, in my local environment, I ran the following command and the test runs successfully.

vpux-opt --split-input-file --init-compiler="vpu-arch=NPU37XX compilation-mode=DefaultHW allow-custom-values=true" --mlir-elide-elementsattrs-if-larger 8 --default-hw-mode-vpuip="function-outlining='naive'" default_hw_mode_repeating_blocks.mlir | FileCheck default_hw_mode_repeating_blocks.mlir

I suspect this issue is related to the recently inserted MLIR code.

    config.Resources {activity_factor = 0.078934384661980161 : f64} 2 of @NCE at 1.700000e+03 MHz {
        builtin.module @ReservedMemory {
        module @DummySWKernelsForInstructionPrefetchReservedMemory {
            config.MemoryResource 8 bytes of @CMX_NN offset 1474552
        }
        }
        config.MemoryResource 1326182 bytes of @CMX_NN_FragmentationAware
        config.MemoryResource 1473536 bytes of @CMX_NN {config.bandwidth = 64 : i64, config.derateFactor = 1.000000e+00 : f64}
        config.ExecutorResource 2 of @SHAVE_ACT
        config.ExecutorResource 1 of @DPU
    }

I am currently uncertain whether the root cause involves the specific values (1326182 or 1473536) or the reserved memory allocation. Since the code runs on both NPU37XX and NPU40XX, I made modifications according to the implementation found in feasible_allocation.mlir.

npu_compiler/tests/lit/NPU/dialect/VPUIP/passes/feasible_allocation.mlir

Lines 123 to 129 in 9f8730d

    
           config.Resources 1 of @NCE at 1.300000e+03 MHz { 
        
               builtin.module @ReservedMemory { 
        
                   module @DmaProfilingReservedMemory { 
        
                       config.MemoryResource 512 bytes of @CMX_NN offset 0 
        
                   } 
        
               } 
        
           }

I hope this resolves the issue.Alternatively, could you provide the scripts necessary to reproduce this experiment in the CI environment?

Kepontry · 2025-12-15T15:18:38Z

Fixed the failing tests on NPU37XX by adjusting the offset of the reserved memory.

Similar changes were applied to the NPU40XX tests.

Note: I'm currently uncertain why the add_sw_kernel_instruction_prefetch_40XX.mlir test is passing, as it uses the same offset as the failed ones.

Kepontry · 2025-12-17T09:07:19Z

Hi @DariaMityagina , the prefetching for the TopK kernel was still problematic due to its complexity, so I decided to remove it for now. Local tests are passing. I also fixed a segfault in the logging logic.

Kepontry · 2025-12-17T16:06:20Z

Hi @DariaMityagina , since TopK is not supported for prefetching now, I replaced it with a Convert kernel in the failing test (add_sw_kernel_instruction_prefetch_mid_execution_40XX.mlir). All tests should pass now. Thanks for your patience.

DariaMityagina · 2025-12-18T12:21:06Z

Hi @DariaMityagina , since TopK is not supported for prefetching now, I replaced it with a Convert kernel in the failing test (add_sw_kernel_instruction_prefetch_mid_execution_40XX.mlir). All tests should pass now. Thanks for your patience.

Hello @Kepontry! Great! Thank you!

Kepontry · 2025-12-18T16:24:07Z

Hi, @DariaMityagina , I refactored the code as requested. I moved the variables into the class and added the comments. Thanks for pointing these out.

DariaMityagina · 2025-12-22T09:08:28Z

Hi, @DariaMityagina , I refactored the code as requested. I moved the variables into the class and added the comments. Thanks for pointing these out.

Hello @Kepontry! Thank you!
We'll perform additional tests to verify the changes and get back to you shortly.

DariaMityagina · 2025-12-25T08:44:34Z

Hello @Kepontry! Apologies for the delay. Due to the holiday season, the next review will be postponed a little. In the meantime, our validation process has identified some issues with the changes in this PR. I'll analyze these issues and share my findings here.

Kepontry · 2026-01-06T11:30:00Z

Hi @DariaMityagina , Happy New Year!

I wanted to follow up on this PR. You mentioned there were some validation issues identified earlier—could you please share the details/logs when you get a chance? I’d like to address those fixes so we can proceed with the review.

DariaMityagina · 2026-01-08T06:51:39Z

Hello @Kepontry!

Merry Christmas and Happy New Year 🎄

Here is the problem found in the logs:

loc(fused<{name = "main", type = "Func"}>["main"]): error: AddSwKernelInstructionPrefetch Pass failed : 
Task queue map not initialized for executor of task <task_id>
L0 pfnCreate2 result: ZE_RESULT_ERROR_INVALID_ARGUMENT, code 0x78000004 - generic error code for invalid arguments . [NPU_VCL] Compiler returned msg:

Compilation failed

Let me find a model you can use for debugging.

DariaMityagina · 2026-01-09T10:51:57Z

Let me find a model you can use for debugging.

@Kepontry hi! You can try to reproduce the issue using this model:
https://huggingface.co/Intel/whisper-small-openvino/blob/main/whisper_small/whisper_small_decoder_static_kvcache_224_lm_QKs.bin
https://huggingface.co/Intel/whisper-small-openvino/raw/main/whisper_small/whisper_small_decoder_static_kvcache_224_lm_QKs.xml

./compile_tool -m whisper_small_decoder_static_kvcache_224_lm_QKs.xml -d NPU.4000

->

[ERROR] 10:51:04.527 [vpux-compiler] Got Diagnostic at loc(fused<{name = "main", type = "Func"}>["main"]) : AddSwKernelInstructionPrefetch Pass failed
src/vpux_compiler/src/core/barrier_info.cpp:1528 Task queue map not initialized for executor of task 2
loc(fused<{name = "main", type = "Func"}>["main"]): error: AddSwKernelInstructionPrefetch Pass failed
src/vpux_compiler/src/core/barrier_info.cpp:1528 Task queue map not initialized for executor of task 2
[ERROR] 10:51:04.528 [vpux-compiler] Failed Pass AddSwKernelInstructionPrefetch on Operation loc(fused<{name = "main", type = "Func"}>["main"])

DariaMityagina · 2026-01-13T07:01:25Z

Let me find a model you can use for debugging.

@Kepontry hi! You can try to reproduce the issue using this model: https://huggingface.co/Intel/whisper-small-openvino/blob/main/whisper_small/whisper_small_decoder_static_kvcache_224_lm_QKs.bin https://huggingface.co/Intel/whisper-small-openvino/raw/main/whisper_small/whisper_small_decoder_static_kvcache_224_lm_QKs.xml
./compile_tool -m whisper_small_decoder_static_kvcache_224_lm_QKs.xml -d NPU.4000
->
[ERROR] 10:51:04.527 [vpux-compiler] Got Diagnostic at loc(fused<{name = "main", type = "Func"}>["main"]) : AddSwKernelInstructionPrefetch Pass failed
src/vpux_compiler/src/core/barrier_info.cpp:1528 Task queue map not initialized for executor of task 2
loc(fused<{name = "main", type = "Func"}>["main"]): error: AddSwKernelInstructionPrefetch Pass failed
src/vpux_compiler/src/core/barrier_info.cpp:1528 Task queue map not initialized for executor of task 2
[ERROR] 10:51:04.528 [vpux-compiler] Failed Pass AddSwKernelInstructionPrefetch on Operation loc(fused<{name = "main", type = "Func"}>["main"])

@Kepontry hello!

When I tested your branch directly, the issue didn't occur. Let me dig deeper to figure out what caused it to show up in our validation environment.

liyihao-1ntel · 2026-01-13T08:46:07Z

+        size_t dynamicExecTile = _dynamicPrefetchTileCounter % numClusters;
+        _dynamicPrefetchTileCounter++;
+
+        auto newPrefetchKernel = insertDummyKernelOpBeforeFirstKernelTask(insertBeforeOp, mlir::ValueRange(),


newPrefetchKernel is set empty update barriers in this case, and insertDummyKernelOpBeforeFirstKernelTask will make the wait barriers empty too. 🤔
I am not sure if they will be scheduled in the slots as we expect. I suppose tasks without wait barriers will be executed in the very beginning? @DariaMityagina Can we have someone confirm this?
If so, the insert function will need a proper wait barrier(maybe also an update barrier) instead of empty for this use case.

You are correct that the update and wait barriers are empty. However, according to my observations from hardware SHAVE profiling, the dummy inst prefetch op is executed in the inserted non-saturated position. If the dummy op were inserted in a saturated position, the original SHAVE task would be postponed or even scheduled to another SHAVE unit. But if you confirm any scheduling behavior is not as expected, adding the barriers is also fine with me.

We should add wait barriers. Not sure why shave task executes at the position you want it to execute but later passes would be free to reorder that prefetch to the beginning of the schedule

liyihao-1ntel · 2026-01-13T08:59:31Z

Hi @Kepontry ! Thank you for your great contribution! I left some comments. Please take a look when you have time 😄

Kepontry · 2026-01-13T16:40:18Z

Hi @liyihao-1ntel ! Thank you for the valuable feedback. I have addressed your comments in the latest update. Please let me know if you have any further questions.

liyihao-1ntel

Discussion about gap finding:

From the PR, we are deciding the gap size by:
gapSize = targetTaskStart/saturationTaskGroupStart - insertionTaskStart
In this case, during the gap the insertTask is still being executed, which might leave limited cache space for dummy kernel.
Will gapSize = targetTaskStart/saturationTaskGroupStart - insertionTaskEnd make more sense?
Question: Given multiple target shave tasks on the same time, theoretically, how will our insertion function work? Will there be multiple dummy kernel tasks inserted?

liyihao-1ntel · 2026-01-14T03:25:53Z

+    size_t _dynamicPrefetchTileCounter = 0;
+    // Using Tile 1 as the target for insertion to enable prefetching only when the available tile count is larger
+    // than 1.
+    int64_t _targetInsertTileDuringExec = 1;


Could you elaborate in comments why we pick a specific tile _targetInsertTileDuringExec here?

This variable is used solely as a reference for gap finding, not for the actual insertion (which, as mentioned, follows a round-robin strategy). I agree the name is slightly misleading, so I plan to rename it.

We chose a specific tile (Tile 1) for two reasons:

When multiple kernels with the same operator execute concurrently (e.g., across Tiles 0-3), the schedule is symmetric. We don't need to calculate the gap for every tile; checking one representative tile is sufficient.

We selected Index 1 (instead of 0) to ensure instruction prefetching is enabled only when the kernel spans at least two tiles, which provides more insertion slots and yields better performance gains.

Kepontry · 2026-01-14T08:05:59Z

While the dummy kernel prefetches instructions to the L1, the primary performance gains actually come from L2 hits by other Shave units. Based on my observations, the 256KB L2 cache is sufficient to hold instructions for multiple kernels (often more than 10), so contention is rarely an issue.
The current gap calculation (targetTaskStart - insertionTaskStart) is intended to reflect the maximum available execution window. While changing this to insertionTaskEnd is acceptable, I am concerned it might be too conservative and cause us to miss valid insertion opportunities.
If multiple Shave tasks execute the same operator, only a single dummy kernel is inserted. I have not yet observed a scenario where tasks with different operators execute concurrently. However, regarding the insertion logic itself: it follows a round-robin manner. The _dynamicPrefetchTileCounter increments after each insertion to determine the specific tile selection.

liyihao-1ntel · 2026-01-15T06:03:22Z

While the dummy kernel prefetches instructions to the L1, the primary performance gains actually come from L2 hits by other Shave units. Based on my observations, the 256KB L2 cache is sufficient to hold instructions for multiple kernels (often more than 10), so contention is rarely an issue.
The current gap calculation (targetTaskStart - insertionTaskStart) is intended to reflect the maximum available execution window. While changing this to insertionTaskEnd is acceptable, I am concerned it might be too conservative and cause us to miss valid insertion opportunities.

If multiple Shave tasks execute the same operator, only a single dummy kernel is inserted. I have not yet observed a scenario where tasks with different operators execute concurrently. However, regarding the insertion logic itself: it follows a round-robin manner. The _dynamicPrefetchTileCounter increments after each insertion to determine the specific tile selection.

Hi @malbecki! This PR is introducing in-the-middle-prefetch-tasks-insertion. Would you help take a look at it when you have time? Would love to know your opinion on L2$ utilization on general cases and on this new approach. 😄 Thanks a lot!

Kepontry · 2026-01-21T17:16:54Z

Hi @liyihao-1ntel , I suspect the GitHub notification might have slipped through for @malbecki . Would you mind pinging him via internal channels to check if he has time to take a look? Thanks!

liyihao-1ntel · 2026-01-22T05:38:34Z

Hi @liyihao-1ntel , I suspect the GitHub notification might have slipped through for @malbecki . Would you mind pinging him via internal channels to check if he has time to take a look? Thanks!

Hi @Kepontry ! I have reached out to malbecki.

Meanwhile, I would like to share some updates IMHO:
cc @DariaMityagina @ksenia-shkileva @Maxim-Doronin

Pro 1: This in-the-middle-prefetch-tasks-insertion does provide more chances to improve shave perf in theory. I think CI results can help us know better about its scale and impact.

Concern 1: The support for dummy kernels profiling on LNL is still incomplete as far as I know, so changes in this PR can bring some invisible dummy shaves in the gaps of shave tasks. Too many potential invisible tasks might compromise profiling effectiveness. This may bring confusion to those developers who are unfamiliar with such feature.

Concern 2: Current scheduling is based on simulator and cost model which will be upgraded timely(Plz correct me if any parts of this statement is wrong). That is to say we are introducing const threshold very cautiously. Insertion in the middle of tasks might be impactful even when we set proper barriers for these new tasks. Checking current CI perf is the first step.

malbecki · 2026-02-02T13:44:38Z

Hello,

Really sorry for the delay in response. I do agree that prefetching in the middle of the schedule would be a great feature however there is a reason why this pass was constrained to only work at the beginning of the schedule and only with intervals that have 250k free cycles on SHAVE and that reason is that our cost model is not very good at estimating certain workloads. I did have a version with prefetch in the middle of the schedule and that version introduced both regressions and improvements on the models we test in our internal CI with models slowing down a bit on average while version with prefetch at the begging with 250k free cycles did eliminate some of the improvements but also eliminated all of the regressions which is why we went with it. I will try to get data on how this PR affects models we test internally and I think we can make a decision whether we can merge it based on that. Of course if there is a good improvement on a specific model like Qwen2(didn't see such an improvement originally) then even if we see overall regression maybe we can talk about merging this code under a compiler option.

Regarding L2$ utilization:
I also did observe that L2$ is effectively large enough to hold all of the code for kernels at least for models we test internally(in other words I wasn't able to find a model with L2$ eviction). That said this pass needs to take that possibility into account in case future generations change that.

Regarding profiling:
I think we can ignore this problem for a discussion of the PR since it is not making the situation worse.

malbecki · 2026-02-02T14:22:12Z

+        _log.trace("insertPoint: {0}, bestReleaseCycle: {1}", *firstShaveTaskInIR, bestReleaseCycle);
+        newPrefetchKernels = insertPrefetchTasks(funcOp, kernelsToPrefetch, firstShaveTaskInIR, bestUpdateBarrier);
+    } else if (_useDummyKernelForInstructionPrefetch) {
+        newPrefetchKernels = insertPrefetchTasksDuringExec(funcOp, kernelsToPrefetch, allTasks);


It would be great if we could make the logic common between prefetching at the start and prefetching in the middle of the schedule since the former is just a special case of latter(at least conceptually). Though I understand that this is probably done to limit the impact on other platforms. If results are good we can follow up on this internally.

malbecki · 2026-02-02T15:25:59Z

+    uint64_t prevTargetTileTaskStartTime = 0;
+
+    // find the largest gap between a non-saturated SW task and a saturated SW task / the kernel to be prefetched
+    for (size_t i = 0; i < allTasks.size(); ++i) {


I think this code assumes there is no eviction event between the start of schedule and kernel start time. This might be true for existing platforms but might not be true for future. If we get good results we can accept it as is but for now let's at least put a comment here that we make such an assumption.

malbecki · 2026-02-02T15:35:23Z

+        _log.trace("Kernel '{0}': Found best gap of {1} cycles. Inserting relative to task {2}.", kernelName,
+                   bestGap.lookaheadGap, bestGap.insertionPointTaskIndex);
+
+        if (bestGap.insertionPointTaskIndex < 0 ||


Can this ever happen?

malbecki · 2026-02-02T16:07:39Z

+        size_t dynamicExecTile = _dynamicPrefetchTileCounter % numClusters;
+        _dynamicPrefetchTileCounter++;
+
+        auto newPrefetchKernel = insertDummyKernelOpBeforeFirstKernelTask(insertBeforeOp, mlir::ValueRange(),


We should add wait barriers. Not sure why shave task executes at the position you want it to execute but later passes would be free to reorder that prefetch to the beginning of the schedule

malbecki · 2026-02-02T16:23:51Z

+        }
+
+        if (prevTargetTileTaskIndex != -1) {
+            size_t simultaneousSwKernels = getSwKernelCountAtTime(prevTargetTileTaskStartTime, allTasks);


I don't think allTasks accounts for prefetch kernels that were inserted in the schedule so I think it is possible that this algorithm inserts more prefetch kernels into a slot theoretically exceeding the number of shaves on NPU. This might cause multiple prefetch kernels on a single SHAVE in a single slot which in turn means that the threshold of 50k might be exceeded.

malbecki · 2026-02-10T10:05:58Z

Update on results:

Ported this change to internal develop branch and got the results. Good news is that I see improvement(~3%) on a few models and I don't see any major regressions.

The bad news is that several models failed to compile, possibly due to changes in scheduler that were made on internal branch. This will require some fixing which I have scheduled internally but I can't provide an ETA for right now since we still didn't decide on the priority. Not sure if open source review can go ahead and diverge with internal branch.

malbecki · 2026-06-10T12:05:46Z

Sorry for a very long wait with no update. Let's start with good news first. I was able to fix all of the issues with failing models that appeared after this PR. It was a mix of existing issues that were exposed as this change has introduced more prefetch and new issues(basically a clash with scheduling code that has assumptions about barrier configuration at this point). That was some 2 months ago and since then I was trying to eliminate performance regressions that appeared on both NPU4000 platforms and NPU5010+ platforms. Sadly during that effort the we discovered that basically this pass has a major problem on a conceptual level in that it doesn't take WLM page split into account which in turn makes this pass produce very unstable schedule performance-wise. This meant that we had to drop prefetching in the middle of the schedule for both NPU5010+ and NPU4000(NPU4000 had better results but too many models got performance regressions for it to be accepted)

There are good reasons why this pass was designed in the way it was but right now to actually get prefetch working beyond start of the inference we believe a WLM-aware prefetching pass is needed which would require a completely different approach that we haven't yet fully designed. We will continue this effort but probably not on a high priority and I can't provide any timeline right now.

Additionally I was also able to salvage some of the improvements this change offered related to the types of prefetched kernels. This offered a good speed up in many models without introducing any significant regressions. Sadly I didn't see any improvement on Qwen2 after just this change.

Kepontry · 2026-06-11T10:45:11Z

Thank you, @malbecki , for spending so much time testing and iterating on this patch. I really appreciate all the effort here.

I would like to better understand the WLM mechanism and the principles behind potential WLM-aware prefetching. Is the recovered performance improvement built on top of the WLM page split? Also, does it introduce or rely on barriers to address the scheduling/order issue?

Could you also share more examples of models or workloads where this patch works well? My current intuition is that models with relatively small subrequest granularity may benefit from it. I am especially interested in exploring possibilities beyond Qwen models, so any additional model coverage or observations would be very helpful.

Enable prefetching of SW kernel instructions after the first SW task

2ecf4c2

Kepontry requested a review from a team as a code owner November 7, 2025 08:18

DariaMityagina added the READY_FOR_REVIEW label Nov 7, 2025

ksenia-shkileva reviewed Nov 7, 2025

View reviewed changes

style: Code cleanup and formatting

681035a

func: Add test case, change t3 to t1

8f39a27

Maxim-Doronin added 2 commits November 24, 2025 11:05

Merge branch 'develop' into upstream

65e4a87

Merge branch 'develop' into upstream

75421b6

Kepontry added 2 commits December 9, 2025 15:03

Add instpf memory to the config of the 4 failed tests

bf5cab7

Fix clang format check

ff57812

Fix memory allocation assertion

38a8e24

Fix memory allocation assertion in NPU40XX tests

48fdf84

Kepontry and others added 3 commits December 16, 2025 00:33

Merge branch 'develop' into upstream

0ccf9bf

Fix CLIP tests in CI

8110a44

Fix clang format check

51c2ac3

Fix mid execution mlir test by replacing topk with convert

e96f0f8

Merge branch 'develop' into upstream

1b364a5

DariaMityagina reviewed Dec 18, 2025

View reviewed changes

Refactor code, rename variables and add comments

a7cc88d

Kepontry force-pushed the upstream branch from c8f614f to a7cc88d Compare December 18, 2025 16:14

Merge branch 'develop' into upstream

bacb483

liyihao-1ntel reviewed Jan 13, 2026

View reviewed changes

Address code review comments

0e46bcd

liyihao-1ntel reviewed Jan 14, 2026

View reviewed changes

Address code review comments

7680564

malbecki reviewed Feb 2, 2026

View reviewed changes

Uh oh!

Conversation

Kepontry commented Nov 7, 2025 • edited by DariaMityagina Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Target Platform For Release Notes

Classification of this Pull Request

Implementation Details

Additional Fixes & Enhancements

Uh oh!

DariaMityagina commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DariaMityagina commented Nov 10, 2025

Uh oh!

Kepontry commented Nov 19, 2025

Uh oh!

Kepontry commented Nov 19, 2025

Uh oh!

Maxim-Doronin commented Nov 24, 2025

Uh oh!

Kepontry commented Nov 24, 2025

Uh oh!

DariaMityagina commented Nov 25, 2025

Uh oh!

DariaMityagina commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kepontry commented Nov 28, 2025

Uh oh!

Kepontry commented Dec 9, 2025

Uh oh!

DariaMityagina commented Dec 15, 2025

Uh oh!

Kepontry commented Dec 15, 2025

Uh oh!

Kepontry commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kepontry commented Dec 17, 2025

Uh oh!

Kepontry commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DariaMityagina commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kepontry commented Dec 18, 2025

Uh oh!

DariaMityagina commented Dec 22, 2025

Uh oh!

DariaMityagina commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kepontry commented Jan 6, 2026

Uh oh!

DariaMityagina commented Jan 8, 2026

Uh oh!

DariaMityagina commented Jan 9, 2026

Uh oh!

DariaMityagina commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kepontry commented Nov 7, 2025 •

edited by DariaMityagina

Loading

DariaMityagina commented Nov 7, 2025 •

edited

Loading

DariaMityagina commented Nov 27, 2025 •

edited

Loading

Kepontry commented Dec 15, 2025 •

edited

Loading

Kepontry commented Dec 17, 2025 •

edited

Loading

DariaMityagina commented Dec 25, 2025 •

edited

Loading

DariaMityagina commented Jan 13, 2026 •

edited

Loading

malbecki commented Feb 2, 2026 •

edited

Loading

malbecki commented Jun 10, 2026 •

edited

Loading