Optimize rmsnorm/layernorm to get better performance than aiter/triton by cschenjunlin · Pull Request #610 · ROCm/FlyDSL

cschenjunlin · 2026-06-02T08:13:40Z

Motivation

Optimize rmsnorm/layernorm to get better performance than aiter/triton

Technical Details

Test Plan

Test Result

Submission Checklist

…, simplify numeric checks

…ed codes

coderfeli · 2026-06-16T13:21:02Z

CI failed @cschenjunlin

coderfeli · 2026-06-17T15:01:56Z

@cschenjunlin Test coverage: both _get_*_configs() now default to only (32768, 8192, "bf16"), with f32/f16, small-N, and unaligned-tail cases commented out. The generic/scalar path (which this PR also modified, e.g. xscale dtype) and f32/f16 are no longer exercised in CI. Suggest keeping at least one unaligned-N and one f32 case active.

Large default shape: 32768x8192 bf16 (~512MB/tensor, several for fused-add-quant) runs in the default non-large flow since these tests are only l2_device/rocm_lower, not large_shape (MI runs took 45-59 min). Consider a small fast-path shape for correctness + a large_shape-marked big shape for perf.

cschenjunlin added 2 commits May 28, 2026 22:39

rmsnorm tests: reorganize variants, use helpers to reduce duplication…

d313f43

…, simplify numeric checks

add vector fast paths for layernorm variants

2b28750

cschenjunlin force-pushed the cjl/norm_optimization branch from 8d34dfc to 2b28750 Compare June 4, 2026 09:27

cschenjunlin added 5 commits June 4, 2026 17:32

refactor: extract scalar/vector load-store helpers to reduce duplicat…

d670743

…ed codes

fix layernorm shared storage name error, use helpers to simplify tests

f6c474c

fix norm xscale dtype issue and extract test helpers

d48c011

fix python code style issue

cb7d5ed

Merge branch 'main' into cjl/norm_optimization

a3800e1

coderfeli and others added 3 commits June 16, 2026 21:21

Merge branch 'main' into cjl/norm_optimization

7dd9982

Merge branch 'main' into cjl/norm_optimization

cfa9acd

replace rsqrt API with fmath.rsqrt

947d00b

Merge branch 'main' into cjl/norm_optimization

328e78a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize rmsnorm/layernorm to get better performance than aiter/triton#610

Optimize rmsnorm/layernorm to get better performance than aiter/triton#610
cschenjunlin wants to merge 11 commits into
mainfrom
cjl/norm_optimization

cschenjunlin commented Jun 2, 2026

Uh oh!

coderfeli commented Jun 16, 2026

Uh oh!

coderfeli commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cschenjunlin commented Jun 2, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

coderfeli commented Jun 16, 2026

Uh oh!

coderfeli commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderfeli commented Jun 17, 2026 •

edited

Loading