OpenMPI runtime tuning (rankfile) #184

VishalKJ · 2019-10-10T10:19:46Z

dear developers,
I am observing quite different times for a sample input 'casscf+xmspt2' when running bagel in parallel using just BAGEL <input_file> or mpirun -np 1 <input_file> . The node I am running on has two sockets with 14 cores on each socket with hypertherading enabled (total 56 cores reported by lscpu) . Using the aforementioned methods of running in both cases the output reports:

process grid (1, 1) will be used
using 56 threads per process

But in case of using without mpirun (i.e. just BAGEL) the times of {MOLECULE,CASSCF,SMITH} are {0.29,9.88,41.77} while if the program is run as mpirun -np 1 BAGEL the times are {1.65,35.14,38.81} . These increases/variability in times of MOLECULE and especially CASSCF section are consistent across multiple runs. Is this expected behaviour ? In addition what is correct way to run BAGL for maximum parallel performance ?

BAGEL compiled with
GCC-8.3.1/MKL/OPENMPI-4.0.1 CFLAGS=-DNDEBUG -O3 -mavx2 with boost_1.71.0

VishalKJ · 2019-10-10T10:36:24Z

In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active

shiozaki · 2019-10-10T12:53:55Z

We wrote in the manual that we strongly discourage use of openmpi - at least in the past openmpi has had bugs or issues that are related to threading. Please use Intel’MPI instead (it’s free - or mvapich, though it sometimes requires some careful settings with MKL’s threading). I have not observed such behavior.

…

On Oct 10, 2019, at 6:36 AM, VishalKJ ***@***.***> wrote: In addition when i monitor the usage using htop in case of BAGEL a lot more cores display usage while in case of mpirun -np 1 BAGEL only 1 core seems active — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

VishalKJ · 2019-10-10T15:00:59Z

Thanks for your reply Dr. Shiozaki. However we managed to resolve the issue. I document so that future readers benefit

If the program is run by 'mpirun -np 1 BAGEL ' OpenMPI only reserves 1 core for the MPI process. This subsequently leads to overbooking of this core with BAGEL_NUM_THREADS number of threads. The problem can be alleviated by using rankfiles whcih specify how to book slots for MPI processes. For example if i want to run just one MPI process using the hyperthreading functioality to fully use 56 threads.

numactl -H gives me the layout:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 0 size: 65436 MB
node 0 free: 42567 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55we have
We can read this output as for socket 0 0-13 are seperate cores and 28-41 are the hyperthreaded cores. This means thread (0,28) are on same core or (1,29) are on same core.

So we build our rankfile as folllows:
cat rankfile_mpi1
rank 0=hostname slot=0-27

In this rankfile we have booked all the cores on both sockets. Thus our mpirun command has now access to all the physical core. Now if we specify BAGEL_NUM_THREDAS/MKL_NUM_THREADS=56 , 56 threads are launched for this mpi process , thus fully taking advantage of all hyperthredaded cores. We can run this by:
mpirun -np 1 -rf rankfile_mpi1 BAGEL inputfile.json

VishalKJ · 2019-10-10T15:03:15Z

To run two mpi processes with one on each socket/numa-node the corresponding rankfile will be
cat rankfile_mpi2
rank 0=argo2 slot=0-13
rank 1=argo2 slot=14-27

run by
mpirun -np 2 -rf rankfile_mpi2 BAGEL inputfile.json

shiozaki · 2019-10-10T15:04:03Z

Thanks - good to know that worked out for you. Will leave this open so others may see it.

shiozaki changed the title ~~BAGEL vs "mpirun -np 1 BAGEL"~~ OpenMPI runtime tuning (rankfile) Oct 10, 2019

shiozaki added the Information label Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMPI runtime tuning (rankfile) #184

OpenMPI runtime tuning (rankfile) #184

VishalKJ commented Oct 10, 2019

VishalKJ commented Oct 10, 2019

shiozaki commented Oct 10, 2019 via email

VishalKJ commented Oct 10, 2019 •

edited

Loading

VishalKJ commented Oct 10, 2019 •

edited

Loading

shiozaki commented Oct 10, 2019

OpenMPI runtime tuning (rankfile) #184

OpenMPI runtime tuning (rankfile) #184

Comments

VishalKJ commented Oct 10, 2019

VishalKJ commented Oct 10, 2019

shiozaki commented Oct 10, 2019 via email

VishalKJ commented Oct 10, 2019 • edited Loading

VishalKJ commented Oct 10, 2019 • edited Loading

shiozaki commented Oct 10, 2019

VishalKJ commented Oct 10, 2019 •

edited

Loading

VishalKJ commented Oct 10, 2019 •

edited

Loading