Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some test_fsdp_core.py cases got random failures in _join_processes(fn) #1475

Open
daisyden opened this issue Mar 17, 2025 · 2 comments
Open
Assignees
Milestone

Comments

@daisyden
Copy link
Contributor

daisyden commented Mar 17, 2025

🐛 Describe the bug

When do the preci test for the branch daisyden/fsdp_test I found some cases of test_fsdp_core.py got random failures, such as:
test_transformer_no_grad_mixed_precision_True_xpu
test_transformer_no_grad_mixed_precision_False_xpu

025-03-13T07:14:07.6789182Z =================================== FAILURES ===================================
2025-03-13T07:14:07.6789816Z _______ TestNoGradXPU.test_transformer_no_grad_mixed_precision_False_xpu _______
2025-03-13T07:14:07.6790144Z Traceback (most recent call last):
2025-03-13T07:14:07.6790791Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 605, in wrapper
2025-03-13T07:14:07.6791247Z     self._join_processes(fn)
2025-03-13T07:14:07.6791695Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 845, in _join_processes
2025-03-13T07:14:07.6792170Z     self._check_return_codes(elapsed_time)
2025-03-13T07:14:07.6792640Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 902, in _check_return_codes
2025-03-13T07:14:07.6793183Z     self.assertEqual(
2025-03-13T07:14:07.6793588Z   File "/home/sdp/miniforge3/envs/xpu_op_/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4094, in assertEqual
2025-03-13T07:14:07.6794079Z     raise error_metas.pop()[0].to_error(  # type: ignore[index]
2025-03-13T07:14:07.6794347Z AssertionError: Scalars are not equal!
2025-03-13T07:14:07.6794486Z 
2025-03-13T07:14:07.6794570Z Expected 0 but got -11.
2025-03-13T07:14:07.6794758Z Absolute difference: 11
2025-03-13T07:14:07.6794932Z Relative difference: inf
2025-03-13T07:14:07.6795197Z Expect process 1 exit code to match Process 0 exit code of 0, but got -11
2025-03-13T07:14:07.6795725Z - generated xml file: /home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/distributed/fsdp/test_fsdp_core.py.xml -

 576 class MultiProcessTestCase(TestCase):
 577     MAIN_PROCESS_RANK = -1
 578     # This exit code is used to indicate that the test code had an error and
 579     # exited abnormally. There are certain tests that might use sys.exit() to
 580     # simulate failures and in those cases, we can't have an exit code of 0,
 581     # but we still want to ensure we didn't run into any other errors.
 582     TEST_ERROR_EXIT_CODE = 10
 583
 584     # do not early terminate for distributed tests.
 585     def _should_stop_test_suite(self) -> bool:
 586         return False
 587
 588     # Many test cases init a process group but do not destroy it.  This property
 589     # determines whether this base test class should call
 590     # `destroy_process_group` on behalf of the test. Its value is customizable
 591     # by derived TestCase's but it is a pan-TestCase value (cannot be customized
 592     # for each test).
 593     @property
 594     def destroy_pg_upon_exit(self) -> bool:
 595         return True
 596
 597     @property
 598     def world_size(self) -> int:
 599         return DEFAULT_WORLD_SIZE
 600
 601     def join_or_run(self, fn):
 602         @wraps(fn)
 603         def wrapper(self):
 604             if self.rank == self.MAIN_PROCESS_RANK:
 **605                 self._join_processes(fn)**
 606             else:
 607                 fn()
 608

Versions

Torch c208f217917929a9f780a81a8c7f788b4c03ee05

@Stonepia
Copy link
Contributor

Stonepia commented Mar 18, 2025

Did you find the UR Error or did you witness the memory high pressure on the task manager? I highly suspect that this random issue is because of the driver.

@chuanqi129 chuanqi129 added this to the PT2.8 milestone Mar 19, 2025
@zhangxiaoli73
Copy link

@daisyden Could you please provide your reproduce steps? The failed log you attached is not the first point of failure. Please check the above logs to make sure the real failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants