Device printf invisible in Jupyter / piped stdout (libc block-buffering of runtime stdout)

## Summary

Device `fx.printf` (lowered to `gpu.printf`) output is delayed in Jupyter cells and in any piped/redirected stdout, even after `torch.cuda.synchronize()`. The output only appears at process teardown or when something explicitly flushes host stdout.

The root cause appears to be **host-side libc block-buffering** of the ROCm runtime's stdout, not a device-side flush problem: the bytes are delivered on synchronize, but they remain in the C stdio buffer because stdout is fully buffered under Jupyter and under pipes.

This makes interactive debugging awkward because notebooks currently need a file-descriptor capture or flush workaround around launches just to show GPU prints inline.

## Repro (MI350X / gfx950, ROCm 7.2)

```python
import torch, flydsl.compiler as flyc, flydsl.expr as fx

@flyc.kernel
def hello_kernel():
    tid = fx.thread_idx.x
    fx.printf("hello from thread {}", tid)

@flyc.jit
def hello(stream: fx.Stream = fx.Stream(None)):
    hello_kernel().launch(grid=(1, 1, 1), block=(4, 1, 1), stream=stream)

hello(); torch.cuda.synchronize()   # notebook / piped stdout: prints nothing immediately
```

## Evidence

Same program with stdout piped, matching the notebook behavior:

- `hello(); torch.cuda.synchronize()` produces no immediate output.
- Immediately calling `ctypes.CDLL("libc.so.6").fflush(None)` makes the 4 lines appear at once.
- Running the unchanged program under `stdbuf -oL` makes the lines appear right after `synchronize()`, in order.

So `synchronize()` appears to make the bytes available; they are stuck in the C stdio buffer until a host-side flush.

## Suggested fix

Have the FlyDSL runtime make device printf output reach host-visible stdout promptly in notebooks and piped logs, for example:

- configure runtime stdout as line-buffered during runtime initialization, or
- flush runtime stdout after kernel launch / synchronization points that expose `gpu.printf` output.

As a stopgap, launching the kernel process under `stdbuf -oL` reproduces the desired behavior without code changes.

---

cc @sjfeng1999 — this is the `printf`-in-notebook buffering issue we discussed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device printf invisible in Jupyter / piped stdout (libc block-buffering of runtime stdout) #653

Summary

Repro (MI350X / gfx950, ROCm 7.2)

Evidence

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Device printf invisible in Jupyter / piped stdout (libc block-buffering of runtime stdout) #653

Description

Summary

Repro (MI350X / gfx950, ROCm 7.2)

Evidence

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions