Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI runtime tuning (rankfile) #184

Open
VishalKJ opened this issue Oct 10, 2019 · 5 comments
Open

OpenMPI runtime tuning (rankfile) #184

VishalKJ opened this issue Oct 10, 2019 · 5 comments

Comments

@VishalKJ
Copy link

dear developers,
I am observing quite different times for a sample input 'casscf+xmspt2' when running bagel in parallel using just BAGEL <input_file> or mpirun -np 1 <input_file> . The node I am running on has two sockets with 14 cores on each socket with hypertherading enabled (total 56 cores reported by lscpu) . Using the aforementioned methods of running in both cases the output reports:

  • process grid (1, 1) will be used
  • using 56 threads per process

But in case of using without mpirun (i.e. just BAGEL) the times of {MOLECULE,CASSCF,SMITH} are {0.29,9.88,41.77} while if the program is run as mpirun -np 1 BAGEL the times are {1.65,35.14,38.81} . These increases/variability in times of MOLECULE and especially CASSCF section are consistent across multiple runs. Is this expected behaviour ? In addition what is correct way to run BAGL for maximum parallel performance ?

BAGEL compiled with
GCC-8.3.1/MKL/OPENMPI-4.0.1 CFLAGS=-DNDEBUG -O3 -mavx2 with boost_1.71.0

@VishalKJ
Copy link
Author

In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active

@shiozaki
Copy link
Member

shiozaki commented Oct 10, 2019 via email

@VishalKJ
Copy link
Author

VishalKJ commented Oct 10, 2019

Thanks for your reply Dr. Shiozaki. However we managed to resolve the issue. I document so that future readers benefit

If the program is run by 'mpirun -np 1 BAGEL ' OpenMPI only reserves 1 core for the MPI process. This subsequently leads to overbooking of this core with BAGEL_NUM_THREADS number of threads. The problem can be alleviated by using rankfiles whcih specify how to book slots for MPI processes. For example if i want to run just one MPI process using the hyperthreading functioality to fully use 56 threads.

numactl -H gives me the layout:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 65436 MB
node 0 free: 42567 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55we have
We can read this output as for socket 0 0-13 are seperate cores and 28-41 are the hyperthreaded cores. This means thread (0,28) are on same core or (1,29) are on same core.

So we build our rankfile as folllows:
cat rankfile_mpi1
rank 0=hostname slot=0-27

In this rankfile we have booked all the cores on both sockets. Thus our mpirun command has now access to all the physical core. Now if we specify BAGEL_NUM_THREDAS/MKL_NUM_THREADS=56 , 56 threads are launched for this mpi process , thus fully taking advantage of all hyperthredaded cores. We can run this by:
mpirun -np 1 -rf rankfile_mpi1 BAGEL inputfile.json

@VishalKJ
Copy link
Author

VishalKJ commented Oct 10, 2019

To run two mpi processes with one on each socket/numa-node the corresponding rankfile will be
cat rankfile_mpi2
rank 0=argo2 slot=0-13
rank 1=argo2 slot=14-27

run by
mpirun -np 2 -rf rankfile_mpi2 BAGEL inputfile.json

@shiozaki
Copy link
Member

Thanks - good to know that worked out for you. Will leave this open so others may see it.

@shiozaki shiozaki changed the title BAGEL vs "mpirun -np 1 BAGEL" OpenMPI runtime tuning (rankfile) Oct 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants