Skip to content

feat: DLSS Frame Generation with Multi-Frame Generation and Dynamic mode#7

Open
Gabrieli2806 wants to merge 8 commits into
Minecraft-Radiance:mainfrom
Gabrieli2806:feat/dlss-frame-gen
Open

feat: DLSS Frame Generation with Multi-Frame Generation and Dynamic mode#7
Gabrieli2806 wants to merge 8 commits into
Minecraft-Radiance:mainfrom
Gabrieli2806:feat/dlss-frame-gen

Conversation

@Gabrieli2806

Copy link
Copy Markdown

Summary

Adds DLSS Frame Generation (DLSS-G) support with Multi-Frame Generation (MFG) and Dynamic mode.

Note: This PR builds on top of #6 (DLSS Ultra Performance mode). The diff will shrink to just the FG commits once #6 is merged.

Features

  • DLSS Frame Generation via NGX SDK (\dlssg_wrapper, \NGX_VK_CREATE_DLSSG\ / \NGX_VK_EVALUATE_DLSSG)
  • Multi-Frame Generation multipliers: x2, x3, x4, x5, x6 interpolated frames per real frame
  • Dynamic mode: per-frame \multiFrameCount = ceil(monitorHz / baseFps) - 1, automatically scales to target monitor refresh rate
  • Same-frame presentation: interpolated frames are presented immediately after the real frame in the same \present()\ call (fixes DLSS-G indicator flickering)
  • GLFW bindings for monitor refresh rate detection (\glfwGetPrimaryMonitor\ / \glfwGetVideoMode)

Java counterpart

  • Radiance mod changes: Gabrieli2806/Radiance feat/dlss-frame-gen

Gabrieli2806 added 7 commits April 11, 2026 01:36
- Add dlssg_wrapper.hpp/cpp with DlssFG class wrapping NGX Frame Gen
- Extend NgxContext with queryFrameGenAvailable() and initFrameGen()
- Add FG attribute handling and initialization in DLSSModule
- Integrate FG evaluation in render framework (double-present for interp frames)
- Create interpolated frame images and blit pipeline
- dlssg_wrapper: evaluate() now accepts multiFrameCount and multiFrameIndex params
- dlss_wrapper: added queryMaxMultiFrameCount() capability query
- dlss_module: changed from bool to uint32_t frameGenMultiFrameCount_, 2D interp images
- dlss_module: parse off/x2/x3/x4 enum values, clamp to hardware max
- render_framework: multi-frame evaluate loop and multi-present in present()
…nous approach

- Remove interpPresentThreadFunc() async thread and all threading infrastructure
  (mutexes, condition variables, atomics, dedicated command pool)
- Implement pipelined synchronous approach: store interp frames from frame N,
  present them at the START of frame N+1's present() when GPU fence is already signaled
- Add PendingInterpPresent struct and presentPendingInterpolatedFrames() method
- Fix crash on world entry with Frame Generation enabled (ACCESS_VIOLATION)
- Fix watchdog timeout from render thread stuck in present()
- Near-zero wait for interp frame fences since a full render frame elapses
- Parse x5 (multiFrameCount=4) and x6 (multiFrameCount=5) attribute values
- Auto mode uses UINT32_MAX sentinel, resolved to hardware max in build()
- Logs auto mode selection with resolved count
- Add glfwGetPrimaryMonitor/glfwGetVideoMode GLFW bindings
- Query monitor refresh rate at init for dynamic FG target
- Dynamic mode: per-frame multiFrameCount = ceil(targetHz / baseFps) - 1
- Allocate interp images to hardware max, use only what's needed
- Reset frameGenDynamic_ when switching to non-dynamic modes
- Replace pipelined presentation (deferred to next frame) with same-frame
- Present interpolated frames immediately after real frame in present()
- Remove PendingInterpPresent struct and related state
- Rename presentPendingInterpolatedFrames -> presentInterpolatedFrames
PEQHUB referenced this pull request in PEQHUB/MCVR Jun 9, 2026
Turns the V2 render graph from 'tracing into empty TLAS' into 'tracing
real chunk geometry' by plumbing Minecraft's chunk mesh output through
the V2 scene services.  Also fixes risks #1 (use-after-free on chunk
re-upload), #4 (chunk origin TODO), and #7 (garbage energy LUT) from
the original 4-pass analysis.

The big shift: chunks.cpp now tees the mesh worker output straight into
CmdChunkSubmit after BlockMesher::mesh() returns.  This path catches
~100% of in-game chunk uploads (the rebuildSingle fallback tee is still
fixed but only fires on the slow path).

ChunkId now 3D (x, sectionY, z):
  * scene_types.hpp: added sectionY field and updated std::hash
  * Was a latent design bug - 24 column sections collided on a single
    {x, z} key, so 23/24 of every column's data was silently dropped
  * replay_recorder.{hpp,cpp}: bumped binary version 1 -> 2 and added
    sectionY to on-disk record; older replays invalid (intentional)

Deferred-delete via ResourceGC (risk #1 fix):
  * blas_service.{hpp,cpp}: init() now takes ResourceGC*; retireBlas()
    wraps VkAccelerationStructureKHR + vk2::Buffer in shared_ptr and
    defers destruction via gc->defer(lambda).  Chunks re-meshed rapidly
    in the same pose will no longer free in-flight BLAS data.
  * tlas_service.{hpp,cpp}: same pattern for old TLAS buffers on rebuild,
    plus for the scratch buffer if its ever replaced mid-flight.
  * gpu_upload_service.{hpp,cpp}: same pattern for vertex/index buffers
    on chunk re-upload and removeChunk.
  * ResourceGC has a 32-frame ring, framesInFlight=2, so deferred
    destructors stay alive for at least 31 frames - plenty of slack.

Chunk origin baked into TLAS instance transform (risk #4 fix):
  * blas_service.hpp: BlasData now caches originX/Y/Z as float (copied
    from GpuChunkData on build).
  * tlas_service.cpp: inst.transform.matrix now writes translation
    (originX, originY, originZ) in the column-3 slot instead of an
    identity matrix.  The closest-hit shader gl_ObjectToWorldEXT
    automatically reflects this - no rchit changes needed.
  * Dropped chunkOriginBuffer_ and chunkOriginArraySize_ - origin is
    in the transform, no parallel SSBO needed.  Resolves the
    tlas_service.cpp:73 TODO from the original session summary.

Energy LUT initialized to 1.0 (risk #7 fix):
  * scene_resource_service.{hpp,cpp}: added pendingEnergyLutStaging_
    member.  init() creates a host-visible staging buffer filled with
    half-float 1.0 (0x3C00) covering all 64x64x4 channels.
  * New runDeferredInit(cmd) records vkCmdCopyBufferToImage on first
    call, with proper layout transitions UNDEFINED -> TRANSFER_DST ->
    SHADER_READ_ONLY.  Staging buffer freed after copy.
  * Called from engine_app.cpp processScene() first frame only.

Production chunk tee (chunks.cpp, +88 lines):
  * After BlockMesher::mesh() returns the meshOutput, copy solid +
    cutout + translucent PBR triangles into a merged vertex array
    with index re-offsetting for V2.
  * Post as CmdChunkSubmit via EngineServices::bridge().
  * Only fires when useV2Engine is true and the bridge has been
    initialized (EngineApp::instance() != nullptr check).
  * This is the hot path - the rebuildSingle fallback path tee in
    ChunkProxy.cpp is kept working (with sectionY) for completeness.

ChunkProxy.cpp fix:
  * The existing rebuildSingle tee was using {cx, cz} as the key with
    no Y component - all 24 column sections silently collided.  Now
    derives sectionY from cmd.originY >> 4 and uses a proper 3D key.

CmdChunkSubmit/CmdChunkRemove schema:
  * bridge_service.hpp: added int32 sectionY field to both command
    structs, placed between chunkZ and existing origin fields for
    locality.
  * engine_app.cpp: handleCommand uses the 3D key throughout.

engine_app.cpp:
  * Scene service init() calls now pass &services_->frame().gc() so
    the deferred-delete wiring is functional.
  * processScene() first-frame path calls sceneRes().runDeferredInit(cmd)
    after scene UBO update but before RT adapter reads.
  * Chunk command handlers use ChunkId{x, sectionY, z} throughout.

Smoke test results (in-game, 90 seconds of V2 gameplay):
  * Pre-fix baseline (empty TLAS):  1.65 ms avg, 651 fps, sky-only
  * After-fix with real chunks:     4.17 ms avg, 240 fps (vsync capped)
  * Chunks flowed: 30323 -> 31007 (684 new chunk meshes through tee)
  * Engine log: 99832 bytes, 0 errors, 0 warnings, 0 validation issues
  * Deferred-delete: confirmed by service init log
    'deferred-delete: on' on all three scene services.
  * Process exit: clean in 1s, no deadlock

The +2.5 ms per frame is real BLAS build + TLAS rebuild + closest-hit
shader work replacing the previous sky-gradient miss-only path.  The
V2 graph has real headroom - it's hitting 240 Hz vsync, not the GPU
ceiling.

Closes risks 1, 4, 7 from the original 4-pass analysis.  Unblocks
PR37 (denoiser), PR39 (real materials), PR40 (DLSS-RR) which all need
real RT output to validate against.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant