Skip to content

[BUG] Memory leaks and crashes on AMD MI300A APU #145

@Tissot11

Description

@Tissot11

I could run successfully a reduced version of magnetised shock problem on CUDA which takes about 4 GM of RAM on 2 nodes (8 GPUs) (according to Slurm) and last for about 7 minutes. However, running the same problem on MI300A (2 nodes, 8 GPUs), there are severe memory leaks (> 500 GB) on single and multiple nodes leading to crash of the run. I attach the err and outfiles together with the shock.txt. I used

cray-hdf5-parallel/1.14.3.1 rocm-6.2.2 modules

with

export HSA_OVERRIDE_GFX_VERSION=9.4.2; export MPICH_GPU_SUPPORT_ENABLED=1

cmake -B build -D pgen=shock -D mpi=ON -D CMAKE_CXX_COMPILER=hipcc -D CMAKE_C_COMPILER=hipcc -D Kokkos_ENABLE_HIP=ON -D Kokkos_ARCH_AMD_GFX942_APU=ON

errEntity.txt
outEntity.txt

shock.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions