Topic:

Tight Integration of the MPICH2 library into SGE.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.1 -- 2008-11-25 Updated release, comments and corrections are welcome
1.1.1 -- 2009-03-10 Updated a wrong spelling of $SGE_TASK_ID in the supplied scripts (thx to Kenneth for mentioning this)

Contents:



Prerequisites
Configuration of SGE with qconf or the GUI

You should already know how to change settings in SGE, like to setup and change a queue definition or the entries in the PE configuration. Additional information about queues and parallel interfaces you can get from the man pages "queue_conf" and "sge_pe" of SGE (make sure the SGE man pages are defined in your $MANPATH).

Target platform

This Howto targets the MPICH2 version 1.0.8 and SGE 6.2 on Linux. Most likely it will work under other operating systems in the same way. Some of the commands will in this case need slight modifications. It will not work this way for MPICH2 version 1.0, as some things were only adjusted since 1.0.1, which will allow an easy Tight Integration.

MPICH2

The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an implementation of the MPI-2 standard. Before you start with the integration of MPICH2 into SGE, you should already be familiar with the operation of MPICH2 outside of SGE and know how to compile a parallel program using MPICH2.

Included setups and scripts

The supplied archive in [1] contains the necessary scripts for the mpd and smpd startup methods (for the gforker method only the example shell script is included, as this startup method needs no scripts to start and stop any daemon). It contains scripts and programs similar to the distribution of the PVM and MPICH integration package in SGE. For installing it for common usage in the whole cluster, you may like to untar it in $SGE_ROOT to get the new directories $SGE_ROOT/mpich2_mpd, $SGE_ROOT/mpich2_smpd, $SGE_ROOT/mpich2_smpd_rsh and $SGE_ROOT/mpich2_gforker.

A short program is provided in [2], which will allow you to observe the correct distribution of the spawned tasks.

Queue configuration

The supplied jobscripts are to be executed under the bash shell. As the default setting in SGE during installation is set to use the csh shell, you might need to change either two entries in the queue definition to read:

$ qconf -sq all.q
...
shell /bin/bash
...
shell_start_mode unix_behavior
...
(please see "man queue_conf" for details about this setting), or submit the (parallel) jobs with the additional argument:
$ qsub -S /bin/bash ...
Please note, that under the Linux operating system /bin/sh is often a link to /bin/bash and can be abbreviated this way.



Introduction to the MPICH2 family
This new MPICH2 implementation of the MPI-2 standard was created to supersede the widely used MPICH(1) implementation. Besides implementing the MPI-2 standard, another goal was a faster startup. To give the user a greater flexibility, there are (for now) 3 startup methods implemented:

Be aware, that for each startup method and chosen way to compile them, you will get a set of mpirun and/or mpiexec for each of them. They are not interchangeable! Hence, once you installed mpd and compiled a program to run in the ring, you can’t switch to smpd simply by using a different mpirun or mpiexec. Instead you have to recompile (or at least relink) your program with the intended libraries to be used with this specific startup method. This means, that you have to plan carefully your set $PATH during compilation and execution of the parallel program, to get a correct behavior. Not doing so will result in strange error messages, which will not point directly to the cause of trouble. After compiling your application software, it may be advisable not to rely on the set $PATH in your interactive shell for the submission, but to set it explicitly in the submitted script to SGE, as we will do it in this Howto for demonstration purpose. Also note, that the preferred startup command in MPICH2 is mpiexec, not mpirun.



Tight Integration of the mpd startup method
First we discuss the integration of the preferred startup method in MPICH2, called mpd. You can compile MPICH2 after you configured it; maybe with an alternative path for the installation of the parallel library:
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/mpd

After the usual make and make install we can compile the short program which is supplied in [2] with:

$ mpicc -o mpihello mpihello.c

Similar to the PVM-Integration, we need a small helping program to start the daemons as a task on the slave nodes using the qrsh-command. In some way, this start_mpich2 can be seen as a generic program extending SGE with the ability to run a qrsh-command in the background, which can easily be modified for similar startup methods.

If you installed the whole package like suggested in $SGE_ROOT/mpich2_mpd, set the working directory to $SGE_ROOT/mpich2_mpd/src and compile the included program with:

$ ./aimk
$ ./install.sh

The installation process will put the helping program mpich2_mpd in a created directory $SGE_ROOT/mpich2_mpd/bin, which is the default location of the included script startmpich2.sh to look for this program. This helper program must be compiled for every platform you have in the cluster, and on which you want to run this startup method. A parallel environment for this startup method may look like:

$ qconf -sp mpich2_mpd
pe_name mpich2_mpd
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \
/home/reuti/local/mpich2-1.0.8/mpd
stop_proc_args /usr/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
/home/reuti/local/mpich2-1.0.8/mpd
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

Remember to attach this PE to a cluster queue of your choice and to adjust the path to your MPICH installation. As the chain of Python modules used in this startup method will create additonal processes and processgroups, it’s essential to include in your cluster configuration a special switch, which will kill the processes at the end of a job or after an issued qdel by identifying the associated processes by an additonal group id, which is attached to all spawned processes on a slave node (by default the processgroup of the first started process by qrsh_starter is used to kill just this complete processgroup including its kids):

$ qconf -sconf
...
execd_params ENABLE_ADDGRP_KILL=TRUE
...

Having done so, we can now submit a job with the ususal sequence:

$ qsub -pe mpich2_mpd 4 mpich2_mpd.sh

This will first start a local mpd process on the master node of the parallel job. Even this process will be started by a local qrsh in the startmpich2.sh, so that the sge_shepherd will stay alive. After it was launched successfully, its used port will be queried and the accompanying daemons on the slave nodes of the parallel job will be started with this information.

The mpich2_mdp.sh will generate a *.po$JOB_ID like:

$ cat mpich2.sh.po628
-catch_rsh /var/spool/sge/pc15381/active_jobs/628.1/pe_hostfile /home/reuti/local/mpich2-1.0.8/mpd
pc15381:2
pc15370:2
startmpich2.sh: check for local mpd daemon (1 of 10)
/usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
/usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
startmpich2.sh: check for mpd daemons (2 of 10)
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2-1.0.8/mpd

The first check will only look for the local mpd daemon, i.e. whether it can contact the local mpd daemon by a mpdtrace -l, as we need this information to instruct the other daemons to use the port which was selected by the first started mpd. The following loop will start the daemons on all remaining slave nodes, and waits until all are up and running. After the startmpich2.sh the mpd processes are available, and the user program started by an mpiexec (or several) in the job script will spawn all processes to the already running mpd. On the master node of the parallel job the following processes can be discovered:

$ ssh pc15381 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
22110 1 22110 /usr/sge/bin/lx24-x86/sge_execd
31712 22110 31712 \_ sge_shepherd-628 -bg
31775 31712 31775 | \_ /bin/sh /var/spool/sge/pc15381/job_scripts/628
31776 31775 31775 | \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpiexec -machinefile /tmp/628.1.all.q/mac
31744 22110 31744 \_ sge_shepherd-628 -bg
31745 31744 31745 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/628.1/1.pc15381
31755 31745 31755 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31777 31755 31777 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31780 31777 31780 | \_ /home/reuti/mpihello
31778 31755 31778 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31779 31778 31779 \_ /home/reuti/mpihello
31736 1 31713 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31759 1 31713 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -

The distribution of user processes is according to the granted slot allocation. The two other processes can be found on the other node:

$ ssh pc15370 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
15848 1 15848 /usr/sge/bin/lx24-x86/sge_execd
3146 15848 3146 \_ sge_shepherd-628 -bg
3148 3146 3148 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15370/active_jobs/628.1/1.pc15370
3156 3148 3156 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
3157 3156 3157 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
3159 3157 3159 | \_ /home/reuti/mpihello
3158 3156 3158 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
3160 3158 3160 \_ /home/reuti/mpihello

On the master node of the parallel job, the console of the mpd-ring will be created in /tmp, as this is unfortunately hard-coded in several places in the MPICH2 source (it might change in version 1.1 of MPICH2 to honor the $TMPDIR which is already set by SGE). To destinguish between several consoles of the same user on a node, the environment variable MPD_CON_EXT is set to reflect the jobnumber and task-id of SGE in the name of the console file.



Tight Integration of the daemonless smpd startup method
To compile MPICH2 for a smpd-based startup, it must first be configured (after a make distclean, in case you just walked through the mpd startup method before):
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/smpd --with-pm=smpd

and to get a Tight Integration we need a PE like the following (including a -catch_rsh to the start script of the PE):

$ qconf -sp mpich2_smpd_rsh
pe_name mpich2_smpd_rsh
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh \
$pe_hostfile
stop_proc_args /usr/sge/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

Please lookup in the MPICH2 documentation, how to create a .smpd file with a "phrase" in it. After submitting the job in exact the same way as before (but this time taking the script mpich2_smpd_rsh.sh in the qsub command):

$ qsub -pe mpich2_smpd_rsh 4 mpich2_smpd_rsh.sh

you should see a distribution on the master node of your parallel job like:

$ ssh pc15381 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
22110 1 22110 /usr/sge/bin/lx24-x86/sge_execd
31930 22110 31930 \_ sge_shepherd-630 -bg
31955 31930 31955 | \_ /bin/sh /var/spool/sge/pc15381/job_scripts/630
31956 31955 31955 | \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/630.1.all.q/machines /home/reuti/mpihello
31957 31956 31955 | \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/630.1.all.q/machines /home/reuti/mpihello
31958 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 env PMI_RANK=0 PMI_SIZE=4 PMI_KVS=359B9A86
31959 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 env PMI_RANK=1 PMI_SIZE=4 PMI_KVS=359B9A86
31960 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 env PMI_RANK=2 PMI_SIZE=4 PMI_KVS=359B9A86
31961 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 env PMI_RANK=3 PMI_SIZE=4 PMI_KVS=359B9A86
31986 22110 31986 \_ sge_shepherd-630 -bg
31987 31986 31987 | \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/630.1/1.pc15381
32004 31987 32004 | \_ /home/reuti/mpihello
31991 22110 31991 \_ sge_shepherd-630 -bg
31992 31991 31992 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/630.1/2.pc15381
32010 31992 32010 \_ /home/reuti/mpihello

The important thing is, that the started script including the mpiexec and the program mpihello are under full SGE control.

(Side note: the default command compiled into MPICH2 this way is ssh -x. You may replace this by changing in the MPICH2 source $MPICH2_ROOT/src/pm/smpd/mpiexec_rsh.c in the routine mpiexec_rsh() the default value ssh -x to a plain rsh, or change it each time during execution of your application program by setting the environment variable “MPIEXEC_RSH=rsh; export MPIEXEC_RSH” to get access to a the rsh-wrapper, like in the original MPICH implementation.)



Tight Integration of the daemon-based smpd startup method
Like for the mpd startup method, we will nees a small helping program. As different parameters have to be used, this program is not identical to the one used in the tight mpd integration.

If you installed the whole package like suggested in $SGE_ROOT/mpich2_smpd, set the working directory to $SGE_ROOT/mpich2_smpd/src and compile the included program with:

$ ./aimk
$ ./install.sh

The installation process will put the helping program mpich2_smpd in a created directory $SGE_ROOT/mpich2_smpd/bin, which is the default location of the included script startmpich2.sh to look for this program. A parallel environment for this startup method may look like:

$ qconf -sp mpich2_smpd
pe_name mpich2_smpd
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/mpich2_smpd/startmpich2.sh -catch_rsh $pe_hostfile \
/home/reuti/local/mpich2-1.0.8/smpd
stop_proc_args /usr/sge/mpich2_smpd/stopmpich2.sh -catch_rsh \
/home/reuti/local/mpich2-1.0.8/smpd
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

If we start the daemons on our own, we have to select a free port. Although it maybe not safe in all cluster setups, the included formula in startmpich2.sh, stopmpich2.sh and the demonstration submit script mpich2_smpd.sh uses “$JOB_ID MOD 5000 + 20000” for the port. Depending on your job turnaround in your cluster, you may modify it in all locations where it’s defined. To force the smpds not to fork themselves into daemon land, they are started with the additional parameter “-d 0”. According to the MPICH2 team, this will not have any speed impact (because the level of debugging is set to 0), but only prevent the daemons from forking. Having this setup in a proper way, we can submit the demonstration job:

$ qsub -pe mpich2_smpd 4 mpich2_smpd.sh

and observe the distributed tasks on the nodes, after looking at the selected nodes:

$ qstat -g t
job-ID prior name user state submit/start at queue master ja-task-ID
------------------------------------------------------------------------------------------------------------------
643 0.55500 mpich2_smp reuti r 11/25/2008 13:11:37 all.q@pc15370.Chemie.Uni-Marbu SLAVE
all.q@pc15370.Chemie.Uni-Marbu SLAVE
643 0.55500 mpich2_smp reuti r 11/25/2008 13:11:37 all.q@pc15381.Chemie.Uni-Marbu MASTER
all.q@pc15381.Chemie.Uni-Marbu SLAVE
all.q@pc15381.Chemie.Uni-Marbu SLAVE

On the head node of the MPICH2 job, a process distribution like the following can be observed:

$ ssh pc15381 ps -e f --cols=120
PID TTY STAT TIME COMMAND
...
22110 ? Sl 1:09 /usr/sge/bin/lx24-x86/sge_execd
2446 ? S 0:00 \_ sge_shepherd-643 -bg
2518 ? Ss 0:00 | \_ /bin/sh /var/spool/sge/pc15381/job_scripts/643
2519 ? S 0:00 | \_ mpiexec -n 4 -machinefile /tmp/643.1.all.q/machines -port 20643 /home/reuti/mpihe
2485 ? Sl 0:00 \_ sge_shepherd-643 -bg
2486 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/643.1/1.pc1
2495 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
2520 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
2521 ? R 0:12 \_ /home/reuti/mpihello
2522 ? R 0:11 \_ /home/reuti/mpihello
2477 ? Sl 0:00 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -
2497 ? Sl 0:00 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -

On the slave node only the only the daemon and the attached processes are shown:

$ ssh pc15370 ps -e f --cols=120
PID TTY STAT TIME COMMAND
...
15848 ? Sl 2:06 /usr/sge/bin/lx24-x86/sge_execd
23121 ? Sl 0:00 \_ sge_shepherd-643 -bg
23122 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15370/active_jobs/643.1/1.pc1
23130 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
23131 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
23132 ? R 0:32 \_ /home/reuti/mpihello
23133 ? R 0:32 \_ /home/reuti/mpihello

The forked-off qrsh-commands by the startmpich2.sh (and start_mpich2 program) are no longer bound to the starting script in start_proc_args, but they are not consuming any CPU time or need to be shut down during a qdel (they are just waiting for the shutdown of the spawned daemons on the slave nodes). Important is, that the working tasks of the mpihello are bound to the process chain, so that the accounting will be correct, and also a controlled shutdown of the daemons is possible. To give some feedback to the user of the started tasks, the *.po$JOB_ID file will contain the check of the started MPICH2 universe:

$ cat mpich2_smpd.sh.po643
-catch_rsh /var/spool/sge/pc15381/active_jobs/643.1/pe_hostfile /home/reuti/local/mpich2-1.0.8/smpd
pc15381
pc15381
pc15370
pc15370
/usr/sge/bin/lx24-x86/qrsh -inherit pc15381 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
/usr/sge/bin/lx24-x86/qrsh -inherit pc15370 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
startmpich2.sh: check for smpd daemons (1 of 10)
startmpich2.sh: found running smpd on pc15381
startmpich2.sh: found running smpd on pc15370
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2-1.0.8/smpd
shutdown smpd on pc15370
shutdown smpd on pc15381

If all is running fine, you may comment out these lines to shorten the output a little bit and avoid any confusion to the user. Depending of your personal taste, you may put the definition of your MPICH2 path in a file like .bashrc, which will be sourced during a non-interactive login.

Note: it is mandatory, that  in your jobscript you include a line "export SMPD_OPTION_NO_DYNAMIC_HOSTS=1" besides the port identification. Otherwise, the node where the jobscript is running will be added to your ~/.smpd. This will prevent a proper shutdown, although this environment variable is already set during the start and stop of the daemons in the appropriate scripts of the PE. Also the option -V will be used in the accompanying skripts for this Howto.


Tight Integration of the gforker startup method

First we discuss the integration of a startup method, which is limited to one machine and hence need no network communication at all. The command line to compile MPICH2 this way is:
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/gforker --with-pm=gforker

After the usual make and make install we can compile the short program which is supplied in [2] with:

$ mpicc -o mpihello mpihello.c

Although we will run only on one machine, we will use a parallel environment (PE) inside SGE, to stay conform with the idea of SGE to request more than one slot by requesting a parallel environment in the submit command. This PE may look like:

$ qconf -sp mpich2_gforker
pe_name mpich2_gforker
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE

Remember to add this PE to a cluster queue of your choice.

$ qsub -pe mpich2_gforker 4 mpich2_gforker.sh

And with:

$ ssh pc15370 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
15848 1 15848 /usr/sge/bin/lx24-x86/sge_execd
7445 15848 7445 \_ sge_shepherd-647 -bg
7447 7445 7447 \_ /bin/sh /var/spool/sge/pc15370/job_scripts/647
7448 7447 7447 \_ mpiexec -n 4 /home/reuti/mpihello
7449 7448 7447 \_ /home/reuti/mpihello
7450 7448 7447 \_ /home/reuti/mpihello
7451 7448 7447 \_ /home/reuti/mpihello
7452 7448 7447 \_ /home/reuti/mpihello

we already got the proper startup and Tight Integration of all started processes.



Nodes with more than one network interface
With the version 1.0.8 of mpich2 it’s possible to direct the network communication to a dedicated interface. For this to work, you have to adjust the generated machine file, i.e. the file $TMPDIR/machines, which is created in the start_proc_args defined script, to include the interface name after the number of slots. E.g.:

node01:2 ifhn=node01-grid


References and Documents
SGE-MPICH2 Integration

[1] Archive with all the scripts used in this Howto: mpich2-62.tgz. It should be installed in your $SGE_ROOT.

[2] Archive with a small MPICH2 program to check the correct distribution of all the tasks: mpihello.tgz.

MPICH2

The latest version of MPICH2 and build instructions can be downloaded from (http://www.mcs.anl.gov/research/projects/mpich2/).

MPI documentation in general and tutorials

For some general introduction to MPI and MPI-Programming, you can study the following documents: