Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2D cumsum throwing GPU Kernel Exception #742

Open
rkierulf opened this issue Mar 21, 2025 · 6 comments
Open

2D cumsum throwing GPU Kernel Exception #742

rkierulf opened this issue Mar 21, 2025 · 6 comments

Comments

@rkierulf
Copy link

Several recent builds for KomaMRI.jl have begun failing with AMDGPU on Julia 1.10. Examples:

https://buildkite.com/julialang/komamri-dot-jl/builds/1418#0195b5f6-5b8f-446e-9800-f59c29ffe098
https://buildkite.com/julialang/komamri-dot-jl/builds/1420#0195ba9c-a682-4918-8cba-97d030849721
https://buildkite.com/julialang/komamri-dot-jl/builds/1417#0195b5d5-72cc-4352-b06e-489fd9865dbf

The line where it fails is here: https://github.com/JuliaHealth/KomaMRI.jl/blob/master/KomaMRICore/src/simulation/SimMethods/BlochDict/BlochDict.jl#L53

This line is just calling cumsum on a 1D ROCArray of Float32 values, and the array is also a view within a larger array. Without having access to an AMD GPU, I can't investigate much further. I wonder if this would be enough to reproduce the issue:

using AMDGPU

A = ROCArray(rand(Float32, 1000))
B = view(A, 500:600)
C = cumsum(B)
@pxl-th
Copy link
Member

pxl-th commented Mar 22, 2025

Hm... Interesting. I cannot reproduce the error with MWE that you provided (my machine also runs as part of the CI).

Can you maybe output the array before cumsum to get the exact values?
Maybe right before this line: https://github.com/JuliaHealth/KomaMRI.jl/blob/dc943ed3c657d9d37fbe62d706536dc4c3ea18ec/KomaMRICore/src/simulation/SimMethods/BlochDict/BlochDict.jl#L54

@rkierulf
Copy link
Author

It looks like the array is just Float32[1.0f-14, 1.0f-14]: https://buildkite.com/julialang/komamri-dot-jl/builds/1421#0195c05a-b862-499f-beda-971831f858a9. There is also a warning before about global hostcalls.

@pxl-th
Copy link
Member

pxl-th commented Mar 23, 2025

Ah... My bad, the error is not related to cumsum because they are asynchronous and are checked before every kernel launch, so the fact that it errors before calling cumsum means that it appeared earlier.
Can you add environment variable HIP_LAUNCH_BLOCKING=1 here to synchronize after every kernel launch immediately and then we'll see what's causing it.

Additionally, the fact that malloc hostcalls are launched means that some kernels emit exception related code that captures original value that (e.g. during rounding, conversion).
In this case it's better to use gpu-friendly functions (to avoid emitting such code).
For example:

gpu_floor(T, x) = unsafe_trunc(T, floor(x))
gpu_ceil(T, x) = unsafe_trunc(T, ceil(x))
gpu_cld(x, y::T) where T = (x + y - one(T)) ÷ y

You can also compare @code_llvm to see how fewer things it does vs the original floor, ceil, cld.

@rkierulf
Copy link
Author

rkierulf commented Mar 23, 2025

Ok, it appears the issue is with a different cumsum here: https://github.com/JuliaHealth/KomaMRI.jl/blob/master/KomaMRIBase/src/timing/TrapezoidalIntegration.jl#L49. This is a 2D cumsum of a matrix across the second dimension. Let me know if you are unable to reproduce on your machine and I can try printing the matrix values beforehand.

@pxl-th
Copy link
Member

pxl-th commented Mar 24, 2025

Still cannot reproduce... If you can print out the values maybe that will help

@rkierulf
Copy link
Author

rkierulf commented Mar 26, 2025

This build has the values printed: https://buildkite.com/julialang/komamri-dot-jl/builds/1428#0195d002-435b-400a-9d1b-1df5624de035. The matrix before the call to cumsum where it crashes has shape 1 x 548 and consists of all zero Float32 values. I also noticed the result is assigned to the same matrix the cumsum is computed on: y = cumsum(y, dims=2), not sure if that affects anything. And this is happening inside tests so could be affected by --check-bounds which I think is always set to yes in package test environments.

If you still can't reproduce, I don't think this is a major issue for KomaMRI.jl since it doesn't affect the default Bloch simulation method (@cncastillo feel free to weigh in) , so should be ok to treat as lower-priority.

@rkierulf rkierulf changed the title 1D cumsum throwing GPU Kernel Exception 2D cumsum throwing GPU Kernel Exception Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants