Add hipGetLastError calls to clear existing errors #686

umfranzw · 2025-02-11T21:56:17Z

The behaviour of hipGetLastError will be changing in an upcoming HIP release. Currently, the error that's reported is cleared on each HIP API call. This means that hipGetLastError reports any error that occurred during the last HIP API call (in the current host thread).

Moving forward, the error that's reported will only be cleared on each call to hipGetLastError. This means that hipGetLastError will report any error that has occurred since the last call to hipGetError (in the current host thread).

Some of our tests rely on observing a return value of hipErrorOutOfMemory from hipMalloc when an allocation is too large for a given GPU architecture's memory system. While this will still work with the future behaviour, it will cause subsequent calls to hipGetError to also report this error.

This change fixes these tests by calling hipGetLastError before sections of code we want to detect errors in, so that any previously recorded error is cleared. This ensures that when we call hipGetLastError again after the code sections of interest complete, it only reports errors from within the sections of interest.

This change also adds error-clearing hipGetLastError calls to other locations (besides the tests mentioned above) where hipGetLastError is called. This is mainly to guard against user code that makes HIP API calls which set an error between rocPRIM calls.

More specifically, this change addes code to clear errors before:

existing calls to hipGetLastError
calls to the ROCPRIM_DETAIL_HIP_SYNC_AND_RETURN_ON_ERROR macro, which calls hipGetLastError

Snektron · 2025-02-12T08:09:52Z

How does this change in HIP interact with the regular error code that is returned from other functions? Say in this scenario:

hipMalloc(very_large_number); // returns hipErrorOutOfMemory
// hipGetLastError(); // would return hipErrorOutOfMemory
kernel<<<..., 9999>>>(...);
hipGetLastError(); // Does this return hipErrorOutOfMemory or hipErrorTooManyThreads?

Naraenda

For the benchmarks and tests: wouldn't it be better to modify the HIP_CHECK(...)-macro to invoke hipGetLastError()?

Regarding the changes in the implementation. It would make more sense to clear the error in body of the public API, so from the code perspective: whenever a public API is called, the last error gets cleared. This is different than clearing before kernel call, but should also be (IMHO) more consistent behaviour: always reset the last error, instead of only clearing whenever rocPRIM invokes a kernel call. I think this would also be a bit more maintainable.

umfranzw · 2025-02-13T16:59:16Z

For the benchmarks and tests: wouldn't it be better to modify the HIP_CHECK(...)-macro to invoke hipGetLastError()?

Regarding the changes in the implementation. It would make more sense to clear the error in body of the public API, so from the code perspective: whenever a public API is called, the last error gets cleared. This is different than clearing before kernel call, but should also be (IMHO) more consistent behaviour: always reset the last error, instead of only clearing whenever rocPRIM invokes a kernel call. I think this would also be a bit more maintainable.

Adding a call to clear the error in HIP_CHECK for benchmarks and tests sounds like a good idea - I'll make the changes.
For the changes in the implementation, I agree that it makes sense to clear the error at the beginning of every public API function. I can make that change as well.

umfranzw · 2025-02-13T17:07:40Z

How does this change in HIP interact with the regular error code that is returned from other functions? Say in this scenario:
hipMalloc(very_large_number); // returns hipErrorOutOfMemory
// hipGetLastError(); // would return hipErrorOutOfMemory
kernel<<<..., 9999>>>(...);
hipGetLastError(); // Does this return hipErrorOutOfMemory or hipErrorTooManyThreads?

With the new behaviour, hipGetLastError always reports the most recent error that occurred in a HIP API call. So in your example, it would report hipErrorTooManyThreads. To the best of my understanding, it always reports the same error code that's returned from a HIP API call.

Future HIP versions will change the behaviour of hipGetLastError slightly. Currently, the function returns any error that occured in the last HIP API call in the current host thread. In other words, the error it reports is cleared with each HIP API call. In the future, the function will return any error that occurred in any HIP API call in the current host thread, since the last time that hipGetLastError was called. In other words, the error it reports will be cleared only on a call to hipGetLastError. A number of rocPRIM tests and benchmarks currently rely on the old behaviour of hipGetLastError. In order to make sure that they continue to work with the future changes, we need to call hipGetLastError before the test/benchmark code is run, so that any previous errors that may have occurred (eg. a call to hipMalloc that failed due to insufficient memory - which happens on some architectures for large test input sizes) get cleared before the test/benchmark calls hipGetLastError. This change: - modifies the HIP_CHECK macro so that it clears hipGetLastError before and after the HIP API call it wraps. It now checks for two types of errors: error returned from the wrapped function call, and errors reported by hipGetLastError after the wrapped call completes. - adds a HIP_CHECK_LAUNCH macro that can be used to wrap kernel calls. It clears any internally recorded HIP error before and after the kernel is invoked. Tests will fail if the hipGetLastError call invoked after the kernel returns an error code. - modifies the HIP_CHECK_MEMORY macro to clear an existing error before the memory allocation call it wraps. If the memory allocation call returns hipErrorOutOfMemory, then hipGetLastError is called again (afterwards) to clear the error. - modifies a few test files so that they use hipLaunchKernelGGL instead of the triple chevron syntax for launching kernels. The triple chevron syntax cannot be wrapped in a call to the HIP_LAUNCH_KERNEL marco.

The behaviour of hipGetLastError will change in the future. With the changes, the error it records will only be cleared on each call to hipGetLastError. Call hipGetLastError at the beginning of public device API functions, since they may call hipGetLastError internally, and we don't want that call to report an error that happened before the function was invoked.

Modify the HIP_CHECK and HIP_CHECK_LAUNCH macros so they more clearly capture returned, pre-launch and post-launch errors. These changes are based on the HIP documentation and example at https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/error_handling.html. For HIP_CHECK, clear any pre-existing error, then capture any error returned by the expression being checked, and any HIP error returned by hipGetLastError. For HIP_CHECK_LAUNCH, clear any pre-existing error, then launch the kernel. Then capture any error returned by hipGetLastError (this will capture pre-launch issues like kernel argument problems), and capture any error returned by hipStreamSynchronize (this will capture in-kernel errors).

Some tests cannot synchronize immediately after a kernel call/ This change splits HIP_CHECK_LAUNCH into two macros: - HIP_CHECK_LAUNCH - does not call hipDeviceSynchronize (leaving detection of in-kernel errors up to the caller) - HIP_CHECK_LAUNCH_SYNC - does call hipDeviceSynchronize (catches in-kernel errors) It also adds a few hipGetLastError calls to clear the internally tracked HIP error for new algorithms that have been added.

umfranzw · 2025-03-07T21:42:38Z

Hi all, I've done my best to implement the changes that we discussed above - please feel free to let me know what you think.
The main modifications I've made are:

I've added some new macros to help check for errors on kernel launch in the tests and benchmarks
I've added calls to hipGetLastError at the beginning of all device level public API functions, so that internally tracked HIP error is cleared with each call.

On the second point, I am planning to also add error-clearing calls to block-level and warp-level functions (so the internally tracked HIP error is cleared on all public API calls), but I wanted to open this up for review in advance, since I'll be on leave for a few weeks.

Naraenda

Found a copy paste error, but other than that looks good. Thanks!

Naraenda · 2025-03-10T08:09:34Z

rocprim/include/rocprim/device/device_binary_search.hpp

+    // Clear any existing error
+    (void) hipGetLastError();
+


Duplicate line

Suggested change

// Clear any existing error

(void) hipGetLastError();

umfranzw requested review from Naraenda and NguyenNhuDi February 11, 2025 21:56

umfranzw requested review from stanleytsang-amd, RobsonRLemos and lawruble13 as code owners February 11, 2025 21:56

Naraenda reviewed Feb 13, 2025

View reviewed changes

umfranzw added 3 commits March 6, 2025 16:49

umfranzw marked this pull request as draft March 7, 2025 13:47

umfranzw force-pushed the get_last_error_update branch from 2075591 to d887b69 Compare March 7, 2025 20:02

umfranzw marked this pull request as ready for review March 7, 2025 21:37

Naraenda approved these changes Mar 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hipGetLastError calls to clear existing errors #686

Add hipGetLastError calls to clear existing errors #686

umfranzw commented Feb 11, 2025

Snektron commented Feb 12, 2025 •

edited

Loading

Naraenda left a comment

umfranzw commented Feb 13, 2025

umfranzw commented Feb 13, 2025

umfranzw commented Mar 7, 2025

Naraenda left a comment

Naraenda Mar 10, 2025

Add hipGetLastError calls to clear existing errors #686

Are you sure you want to change the base?

Add hipGetLastError calls to clear existing errors #686

Conversation

umfranzw commented Feb 11, 2025

Snektron commented Feb 12, 2025 • edited Loading

Naraenda left a comment

Choose a reason for hiding this comment

umfranzw commented Feb 13, 2025

umfranzw commented Feb 13, 2025

umfranzw commented Mar 7, 2025

Naraenda left a comment

Choose a reason for hiding this comment

Naraenda Mar 10, 2025

Choose a reason for hiding this comment

Snektron commented Feb 12, 2025 •

edited

Loading