You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When do the preci test for the branch daisyden/fsdp_test I found some cases of test_fsdp_core.py got random failures, such as:
test_transformer_no_grad_mixed_precision_True_xpu
test_transformer_no_grad_mixed_precision_False_xpu
025-03-13T07:14:07.6789182Z =================================== FAILURES ===================================
2025-03-13T07:14:07.6789816Z _______ TestNoGradXPU.test_transformer_no_grad_mixed_precision_False_xpu _______
2025-03-13T07:14:07.6790144Z Traceback (most recent call last):
2025-03-13T07:14:07.6790791Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 605, in wrapper
2025-03-13T07:14:07.6791247Z self._join_processes(fn)
2025-03-13T07:14:07.6791695Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 845, in _join_processes
2025-03-13T07:14:07.6792170Z self._check_return_codes(elapsed_time)
2025-03-13T07:14:07.6792640Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 902, in _check_return_codes
2025-03-13T07:14:07.6793183Z self.assertEqual(
2025-03-13T07:14:07.6793588Z File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4094, in assertEqual
2025-03-13T07:14:07.6794079Z raise error_metas.pop()[0].to_error( # type: ignore[index]
2025-03-13T07:14:07.6794347Z AssertionError: Scalars are not equal!
2025-03-13T07:14:07.6794486Z
2025-03-13T07:14:07.6794570Z Expected 0 but got -11.
2025-03-13T07:14:07.6794758Z Absolute difference: 11
2025-03-13T07:14:07.6794932Z Relative difference: inf
2025-03-13T07:14:07.6795197Z Expect process 1 exit code to match Process 0 exit code of 0, but got -11
2025-03-13T07:14:07.6795725Z - generated xml file: /home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/distributed/fsdp/test_fsdp_core.py.xml -
576 class MultiProcessTestCase(TestCase):
577 MAIN_PROCESS_RANK = -1
578 # This exit code is used to indicate that the test code had an error and
579 # exited abnormally. There are certain tests that might use sys.exit() to
580 # simulate failures and in those cases, we can't have an exit code of 0,
581 # but we still want to ensure we didn't run into any other errors.
582 TEST_ERROR_EXIT_CODE = 10
583
584 # do not early terminate for distributed tests.
585 def _should_stop_test_suite(self) -> bool:
586 return False
587
588 # Many test cases init a process group but do not destroy it. This property
589 # determines whether this base test class should call
590 # `destroy_process_group` on behalf of the test. Its value is customizable
591 # by derived TestCase's but it is a pan-TestCase value (cannot be customized
592 # for each test).
593 @property
594 def destroy_pg_upon_exit(self) -> bool:
595 return True
596
597 @property
598 def world_size(self) -> int:
599 return DEFAULT_WORLD_SIZE
600
601 def join_or_run(self, fn):
602 @wraps(fn)
603 def wrapper(self):
604 if self.rank == self.MAIN_PROCESS_RANK:
**605 self._join_processes(fn)**
606 else:
607 fn()
608
Versions
Torch c208f217917929a9f780a81a8c7f788b4c03ee05
The text was updated successfully, but these errors were encountered:
Did you find the UR Error or did you witness the memory high pressure on the task manager? I highly suspect that this random issue is because of the driver.
@daisyden Could you please provide your reproduce steps? The failed log you attached is not the first point of failure. Please check the above logs to make sure the real failure.
🐛 Describe the bug
When do the preci test for the branch daisyden/fsdp_test I found some cases of test_fsdp_core.py got random failures, such as:
test_transformer_no_grad_mixed_precision_True_xpu
test_transformer_no_grad_mixed_precision_False_xpu
Versions
Torch c208f217917929a9f780a81a8c7f788b4c03ee05
The text was updated successfully, but these errors were encountered: