Summary
Device fx.printf (lowered to gpu.printf) output is delayed in Jupyter cells and in any piped/redirected stdout, even after torch.cuda.synchronize(). The output only appears at process teardown or when something explicitly flushes host stdout.
The root cause appears to be host-side libc block-buffering of the ROCm runtime's stdout, not a device-side flush problem: the bytes are delivered on synchronize, but they remain in the C stdio buffer because stdout is fully buffered under Jupyter and under pipes.
This makes interactive debugging awkward because notebooks currently need a file-descriptor capture or flush workaround around launches just to show GPU prints inline.
Repro (MI350X / gfx950, ROCm 7.2)
import torch, flydsl.compiler as flyc, flydsl.expr as fx
@flyc.kernel
def hello_kernel():
tid = fx.thread_idx.x
fx.printf("hello from thread {}", tid)
@flyc.jit
def hello(stream: fx.Stream = fx.Stream(None)):
hello_kernel().launch(grid=(1, 1, 1), block=(4, 1, 1), stream=stream)
hello(); torch.cuda.synchronize() # notebook / piped stdout: prints nothing immediately
Evidence
Same program with stdout piped, matching the notebook behavior:
hello(); torch.cuda.synchronize() produces no immediate output.
- Immediately calling
ctypes.CDLL("libc.so.6").fflush(None) makes the 4 lines appear at once.
- Running the unchanged program under
stdbuf -oL makes the lines appear right after synchronize(), in order.
So synchronize() appears to make the bytes available; they are stuck in the C stdio buffer until a host-side flush.
Suggested fix
Have the FlyDSL runtime make device printf output reach host-visible stdout promptly in notebooks and piped logs, for example:
- configure runtime stdout as line-buffered during runtime initialization, or
- flush runtime stdout after kernel launch / synchronization points that expose
gpu.printf output.
As a stopgap, launching the kernel process under stdbuf -oL reproduces the desired behavior without code changes.
cc @sjfeng1999 — this is the printf-in-notebook buffering issue we discussed.
Summary
Device
fx.printf(lowered togpu.printf) output is delayed in Jupyter cells and in any piped/redirected stdout, even aftertorch.cuda.synchronize(). The output only appears at process teardown or when something explicitly flushes host stdout.The root cause appears to be host-side libc block-buffering of the ROCm runtime's stdout, not a device-side flush problem: the bytes are delivered on synchronize, but they remain in the C stdio buffer because stdout is fully buffered under Jupyter and under pipes.
This makes interactive debugging awkward because notebooks currently need a file-descriptor capture or flush workaround around launches just to show GPU prints inline.
Repro (MI350X / gfx950, ROCm 7.2)
Evidence
Same program with stdout piped, matching the notebook behavior:
hello(); torch.cuda.synchronize()produces no immediate output.ctypes.CDLL("libc.so.6").fflush(None)makes the 4 lines appear at once.stdbuf -oLmakes the lines appear right aftersynchronize(), in order.So
synchronize()appears to make the bytes available; they are stuck in the C stdio buffer until a host-side flush.Suggested fix
Have the FlyDSL runtime make device printf output reach host-visible stdout promptly in notebooks and piped logs, for example:
gpu.printfoutput.As a stopgap, launching the kernel process under
stdbuf -oLreproduces the desired behavior without code changes.cc @sjfeng1999 — this is the
printf-in-notebook buffering issue we discussed.