-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hipGetLastError calls to clear existing errors #686
base: develop
Are you sure you want to change the base?
Conversation
How does this change in HIP interact with the regular error code that is returned from other functions? Say in this scenario: hipMalloc(very_large_number); // returns hipErrorOutOfMemory
// hipGetLastError(); // would return hipErrorOutOfMemory
kernel<<<..., 9999>>>(...);
hipGetLastError(); // Does this return hipErrorOutOfMemory or hipErrorTooManyThreads? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the benchmarks and tests: wouldn't it be better to modify the HIP_CHECK(...)
-macro to invoke hipGetLastError()
?
Regarding the changes in the implementation. It would make more sense to clear the error in body of the public API, so from the code perspective: whenever a public API is called, the last error gets cleared. This is different than clearing before kernel call, but should also be (IMHO) more consistent behaviour: always reset the last error, instead of only clearing whenever rocPRIM invokes a kernel call. I think this would also be a bit more maintainable.
Adding a call to clear the error in HIP_CHECK for benchmarks and tests sounds like a good idea - I'll make the changes. |
With the new behaviour, hipGetLastError always reports the most recent error that occurred in a HIP API call. So in your example, it would report hipErrorTooManyThreads. To the best of my understanding, it always reports the same error code that's returned from a HIP API call. |
Future HIP versions will change the behaviour of hipGetLastError slightly. Currently, the function returns any error that occured in the last HIP API call in the current host thread. In other words, the error it reports is cleared with each HIP API call. In the future, the function will return any error that occurred in any HIP API call in the current host thread, since the last time that hipGetLastError was called. In other words, the error it reports will be cleared only on a call to hipGetLastError. A number of rocPRIM tests and benchmarks currently rely on the old behaviour of hipGetLastError. In order to make sure that they continue to work with the future changes, we need to call hipGetLastError before the test/benchmark code is run, so that any previous errors that may have occurred (eg. a call to hipMalloc that failed due to insufficient memory - which happens on some architectures for large test input sizes) get cleared before the test/benchmark calls hipGetLastError. This change: - modifies the HIP_CHECK macro so that it clears hipGetLastError before and after the HIP API call it wraps. It now checks for two types of errors: error returned from the wrapped function call, and errors reported by hipGetLastError after the wrapped call completes. - adds a HIP_CHECK_LAUNCH macro that can be used to wrap kernel calls. It clears any internally recorded HIP error before and after the kernel is invoked. Tests will fail if the hipGetLastError call invoked after the kernel returns an error code. - modifies the HIP_CHECK_MEMORY macro to clear an existing error before the memory allocation call it wraps. If the memory allocation call returns hipErrorOutOfMemory, then hipGetLastError is called again (afterwards) to clear the error. - modifies a few test files so that they use hipLaunchKernelGGL instead of the triple chevron syntax for launching kernels. The triple chevron syntax cannot be wrapped in a call to the HIP_LAUNCH_KERNEL marco.
The behaviour of hipGetLastError will change in the future. With the changes, the error it records will only be cleared on each call to hipGetLastError. Call hipGetLastError at the beginning of public device API functions, since they may call hipGetLastError internally, and we don't want that call to report an error that happened before the function was invoked.
Modify the HIP_CHECK and HIP_CHECK_LAUNCH macros so they more clearly capture returned, pre-launch and post-launch errors. These changes are based on the HIP documentation and example at https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_runtime_api/error_handling.html. For HIP_CHECK, clear any pre-existing error, then capture any error returned by the expression being checked, and any HIP error returned by hipGetLastError. For HIP_CHECK_LAUNCH, clear any pre-existing error, then launch the kernel. Then capture any error returned by hipGetLastError (this will capture pre-launch issues like kernel argument problems), and capture any error returned by hipStreamSynchronize (this will capture in-kernel errors).
Some tests cannot synchronize immediately after a kernel call/ This change splits HIP_CHECK_LAUNCH into two macros: - HIP_CHECK_LAUNCH - does not call hipDeviceSynchronize (leaving detection of in-kernel errors up to the caller) - HIP_CHECK_LAUNCH_SYNC - does call hipDeviceSynchronize (catches in-kernel errors) It also adds a few hipGetLastError calls to clear the internally tracked HIP error for new algorithms that have been added.
2075591
to
d887b69
Compare
Hi all, I've done my best to implement the changes that we discussed above - please feel free to let me know what you think.
On the second point, I am planning to also add error-clearing calls to block-level and warp-level functions (so the internally tracked HIP error is cleared on all public API calls), but I wanted to open this up for review in advance, since I'll be on leave for a few weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a copy paste error, but other than that looks good. Thanks!
// Clear any existing error | ||
(void) hipGetLastError(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate line
// Clear any existing error | |
(void) hipGetLastError(); |
The behaviour of hipGetLastError will be changing in an upcoming HIP release. Currently, the error that's reported is cleared on each HIP API call. This means that hipGetLastError reports any error that occurred during the last HIP API call (in the current host thread).
Moving forward, the error that's reported will only be cleared on each call to hipGetLastError. This means that hipGetLastError will report any error that has occurred since the last call to hipGetError (in the current host thread).
Some of our tests rely on observing a return value of hipErrorOutOfMemory from hipMalloc when an allocation is too large for a given GPU architecture's memory system. While this will still work with the future behaviour, it will cause subsequent calls to hipGetError to also report this error.
This change fixes these tests by calling hipGetLastError before sections of code we want to detect errors in, so that any previously recorded error is cleared. This ensures that when we call hipGetLastError again after the code sections of interest complete, it only reports errors from within the sections of interest.
This change also adds error-clearing hipGetLastError calls to other locations (besides the tests mentioned above) where hipGetLastError is called. This is mainly to guard against user code that makes HIP API calls which set an error between rocPRIM calls.
More specifically, this change addes code to clear errors before: