Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testcase he3_svp_asd-dmrg fails with MPICH 4.0.1 #242

Open
mbanck opened this issue Apr 3, 2022 · 5 comments
Open

Testcase he3_svp_asd-dmrg fails with MPICH 4.0.1 #242

mbanck opened this issue Apr 3, 2022 · 5 comments

Comments

@mbanck
Copy link

mbanck commented Apr 3, 2022

First reported here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006788

If I revert MPICH to 3.4.1, the testcase runs fine. If I use MPICH 4.0.1, it fails with

 ===== Starting sweeps =====

  o convergence threshold: 1.0000e-08
  iter state         sweep average     sweep range      dE average
  ERROR: EXCEPTION RAISED:  dsyev/pdsyevd failed in Matrix
@mbanck
Copy link
Author

mbanck commented Apr 3, 2022

Hrm, I've found #240 now which is related - in that you mentioned that this test case is bound to fail, but it sounds like due to numerical noise, not dsyev/pdsyevd? Is this a separate issue?

The Ubuntu testsuite results in that other issue are not very helpful, it just says FAILED. I've changed the testsuite script to dump the last 50 lines of output for failed test cases now.

@shiozaki
Copy link
Member

shiozaki commented Jun 3, 2022

Hi Michael, sorry for the very late response. This test is supposed to converge to incorrect results and is not supposed to throw errors. I am not exactly sure what this is without reproducing myself. Thanks for reporting.

@mbanck
Copy link
Author

mbanck commented Jun 3, 2022

It seems to be flakey - the test ran fine again on the next upload.

Not sure whether this can be tracked down definitively - I downgraded the corresponding Debian bug, but that's not really an option for Github.

I'll run another test build overnight and see what the current status is on my personal box.

@AdrianBunk
Copy link

New information from the Debian bug:

it seems related to the host that runs the 
test. I.e. the test fails on our beefy amd64 host (ci-worker13) with 64 
cores and 256GB RAM, but seems to pass on the others.

The error on s390x is the same by the way (that has 10 cores and 32GB RAM).

@mbanck
Copy link
Author

mbanck commented Nov 27, 2022

So two things seem to work-around this:

  1. downgrading mpich from 4.0.x to 3.x
  2. setting BAGEL_NUM_THREAD=4 (it fails with 8 or 16)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants