Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hipGetLastError calls to clear existing errors #686

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

umfranzw
Copy link
Collaborator

The behaviour of hipGetLastError will be changing in an upcoming HIP release. Currently, the error that's reported is cleared on each HIP API call. This means that hipGetLastError reports any error that occurred during the last HIP API call (in the current host thread).

Moving forward, the error that's reported will only be cleared on each call to hipGetLastError. This means that hipGetLastError will report any error that has occurred since the last call to hipGetError (in the current host thread).

Some of our tests rely on observing a return value of hipErrorOutOfMemory from hipMalloc when an allocation is too large for a given GPU architecture's memory system. While this will still work with the future behaviour, it will cause subsequent calls to hipGetError to also report this error.

This change fixes these tests by calling hipGetLastError before sections of code we want to detect errors in, so that any previously recorded error is cleared. This ensures that when we call hipGetLastError again after the code sections of interest complete, it only reports errors from within the sections of interest.

This change also adds error-clearing hipGetLastError calls to other locations (besides the tests mentioned above) where hipGetLastError is called. This is mainly to guard against user code that makes HIP API calls which set an error between rocPRIM calls.

More specifically, this change addes code to clear errors before:

  • existing calls to hipGetLastError
  • calls to the ROCPRIM_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR macro, which calls hipGetLastError

@Snektron
Copy link
Contributor

Snektron commented Feb 12, 2025

How does this change in HIP interact with the regular error code that is returned from other functions? Say in this scenario:

hipMalloc(very_large_number); // returns hipErrorOutOfMemory
// hipGetLastError(); // would return hipErrorOutOfMemory
kernel<<<..., 9999>>>(...);
hipGetLastError(); // Does this return hipErrorOutOfMemory or hipErrorTooManyThreads?

Copy link
Member

@Naraenda Naraenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the benchmarks and tests: wouldn't it be better to modify the HIP_CHECK(...)-macro to invoke hipGetLastError()?

Regarding the changes in the implementation. It would make more sense to clear the error in body of the public API, so from the code perspective: whenever a public API is called, the last error gets cleared. This is different than clearing before kernel call, but should also be (IMHO) more consistent behaviour: always reset the last error, instead of only clearing whenever rocPRIM invokes a kernel call. I think this would also be a bit more maintainable.

@umfranzw
Copy link
Collaborator Author

For the benchmarks and tests: wouldn't it be better to modify the HIP_CHECK(...)-macro to invoke hipGetLastError()?

Regarding the changes in the implementation. It would make more sense to clear the error in body of the public API, so from the code perspective: whenever a public API is called, the last error gets cleared. This is different than clearing before kernel call, but should also be (IMHO) more consistent behaviour: always reset the last error, instead of only clearing whenever rocPRIM invokes a kernel call. I think this would also be a bit more maintainable.

Adding a call to clear the error in HIP_CHECK for benchmarks and tests sounds like a good idea - I'll make the changes.
For the changes in the implementation, I agree that it makes sense to clear the error at the beginning of every public API function. I can make that change as well.

@umfranzw
Copy link
Collaborator Author

How does this change in HIP interact with the regular error code that is returned from other functions? Say in this scenario:

hipMalloc(very_large_number); // returns hipErrorOutOfMemory
// hipGetLastError(); // would return hipErrorOutOfMemory
kernel<<<..., 9999>>>(...);
hipGetLastError(); // Does this return hipErrorOutOfMemory or hipErrorTooManyThreads?

With the new behaviour, hipGetLastError always reports the most recent error that occurred in a HIP API call. So in your example, it would report hipErrorTooManyThreads. To the best of my understanding, it always reports the same error code that's returned from a HIP API call.

umfranzw added 3 commits March 6, 2025 16:49
Future HIP versions will change the behaviour of hipGetLastError slightly.
Currently, the function returns any error that occured in the last HIP
API call in the current host thread. In other words, the error it reports
is cleared with each HIP API call.

In the future, the function will return any error that occurred in any HIP
API call in the current host thread, since the last time that hipGetLastError
was called. In other words, the error it reports will be cleared only on
a call to hipGetLastError.

A number of rocPRIM tests and benchmarks currently rely on the old behaviour of
hipGetLastError. In order to make sure that they continue to work with the
future changes, we need to call hipGetLastError before the test/benchmark
code is run, so that any previous errors that may have occurred (eg. a call
to hipMalloc that failed due to insufficient memory - which happens on some
architectures for large test input sizes) get cleared before the test/benchmark
calls hipGetLastError.

This change:
- modifies the HIP_CHECK macro so that it clears hipGetLastError
before and after the HIP API call it wraps. It now checks for two
types of errors: error returned from the wrapped function call, and
errors reported by hipGetLastError after the wrapped call completes.

- adds a HIP_CHECK_LAUNCH macro that can be used to wrap kernel calls.
It clears any internally recorded HIP error before and after the kernel
is invoked. Tests will fail if the hipGetLastError call invoked after
the kernel returns an error code.

- modifies the HIP_CHECK_MEMORY macro to clear an existing error before
the memory allocation call it wraps. If the memory allocation call
returns hipErrorOutOfMemory, then hipGetLastError is called again
(afterwards) to clear the error.

- modifies a few test files so that they use hipLaunchKernelGGL instead
of the triple chevron syntax for launching kernels. The triple chevron
syntax cannot be wrapped in a call to the HIP_LAUNCH_KERNEL marco.
The behaviour of hipGetLastError will change in the future.
With the changes, the error it records will only be cleared
on each call to hipGetLastError.

Call hipGetLastError at the beginning of public device API
functions, since they may call hipGetLastError internally,
and we don't want that call to report an error that happened
before the function was invoked.
Modify the HIP_CHECK and HIP_CHECK_LAUNCH macros so they
more clearly capture returned, pre-launch and post-launch
errors.

These changes are based on the HIP documentation and example at
https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/error_handling.html.

For HIP_CHECK, clear any pre-existing error, then capture any
error returned by the expression being checked, and any HIP error
returned by hipGetLastError.

For HIP_CHECK_LAUNCH, clear any pre-existing error, then launch
the kernel. Then capture any error returned by hipGetLastError
(this will capture pre-launch issues like kernel argument problems),
and capture any error returned by hipStreamSynchronize (this will
capture in-kernel errors).
@umfranzw umfranzw marked this pull request as draft March 7, 2025 13:47
Some tests cannot synchronize immediately after a kernel call/
This change splits HIP_CHECK_LAUNCH into two macros:
- HIP_CHECK_LAUNCH - does not call hipDeviceSynchronize (leaving detection
of in-kernel errors up to the caller)
- HIP_CHECK_LAUNCH_SYNC - does call hipDeviceSynchronize (catches
in-kernel errors)

It also adds a few hipGetLastError calls to clear the internally tracked
HIP error for new algorithms that have been added.
@umfranzw umfranzw force-pushed the get_last_error_update branch from 2075591 to d887b69 Compare March 7, 2025 20:02
@umfranzw umfranzw marked this pull request as ready for review March 7, 2025 21:37
@umfranzw
Copy link
Collaborator Author

umfranzw commented Mar 7, 2025

Hi all, I've done my best to implement the changes that we discussed above - please feel free to let me know what you think.
The main modifications I've made are:

  • I've added some new macros to help check for errors on kernel launch in the tests and benchmarks
  • I've added calls to hipGetLastError at the beginning of all device level public API functions, so that internally tracked HIP error is cleared with each call.

On the second point, I am planning to also add error-clearing calls to block-level and warp-level functions (so the internally tracked HIP error is cleared on all public API calls), but I wanted to open this up for review in advance, since I'll be on leave for a few weeks.

Copy link
Member

@Naraenda Naraenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a copy paste error, but other than that looks good. Thanks!

Comment on lines +358 to +360
// Clear any existing error
(void) hipGetLastError();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate line

Suggested change
// Clear any existing error
(void) hipGetLastError();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants