Topic:
Loose and tight integration of the LAM/MPI library into SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany
Version:
1.0a -- 2005-03-23 Initial release, comments and corrections are welcome
Contents:
Note:
This HOWTO complements the information contained in the man pages of SGE and the Administration Guide.
Acknowledgement:
Many thanks to Sean Dilda for pointing out a bug in the lamd_wrapper and eliminating an ambiguity.
Configuration of SGE with qconf or the GUI &endash; Compilation of LAM/MPI &endash; Cluster setupYou should be familiar with modifying the SGE environment such as the queue definitions and parallel environment (PE) configuration. Additional information on the SGE parallel environment is available from the "sge_pe" man pages (make sure the SGE man pages are defined in your $MANPATH). You should also have access to the LAM/MPI source code, and know how to build and install it. More information can be found in the LAM/MPI Documentation (see References and Documents). For the "Loose integration using rsh" startup method, a working (passwordless) login between the nodes in the cluster is also required.
Target platform
This Howto targets the LAM/MPI 7.1.1 on the Linux platform. Due to the modifications made to the source code and introduced startup sequence, these integrations using qrsh will not work on platforms, where SGE will use the additional group feature for a proper shutdown by default.
LAM/MPI
The LAM/MPI library from the Indiana University (http://www.lam-mpi.org) is a MPI implementation. In contrast to other implementations like MPICH (http://www-unix.mcs.anl.gov/mpi/mpich), this one is using daemons on all the included nodes in a parallel job. One advantage of such an approach is, that programs issuing many mpirun commands during their execution need only one time a "conventional" communication to the nodes to start the daemons. All further communication is handled by the already running daemons, and so additional mpiruns will startup faster.
Available startup methods in SGE
For now there is no startup possibility in SGE, which allows a program started on a node to vanish into daemon land and still be under SGE control for correct accounting and proper shutdown of all slave tasks. Therefore the loose integrations will rely on a shutdown of the daemons by a conventional LAM/MPI command, while the tight integration uses a 2-stage startup, which was introduced by Christopher Duncan for former LAM/MPI versions [1].
Included setups and scripts
All three possibilities to startup the LAM/MPI daemons introduced here can be installed in parallel, as they have a unique PE and script directory. This way you can play around with them, and decide which one to install in your cluster. While walking through this Howto, you can install all of them in a shared directory, e.g. your home directory. For the final installation it could be put in the conventional place like $SGE_ROOT. The scripts, which are only slight modifications of the original scripts in the $SGE_ROOT/mpi directory, can be downloaded here for your convenience (the inserted lines are commented) [2].
Reading of this document
To avoid explaining the complete LAM/MPI behavior in each of the possible startup methods, this document should be read from the beginning onward, although you are e.g. only interested in the Tight integration. Some of the special settings of the scripts and PEs will explain itself, when the default behavior is known.
Characteristic of the Startup MethodUsing a simple (passwordless) rsh between the nodes, the daemons on the slave nodes are started up.
Since LAM/MPI 7.1.1 is SGE-aware, it will create a job specific directory on the slave nodes of the job of the form "lam-$USER@$HOSTNAME-sge-$JOB_ID-$SGE_TASK_ID". So they will be unique for each job and not interfere, if two (or more) LAM/MPI jobs are running on the same node. They will be located in 1. $TMPDIR or 2. /tmp. As $TMPDIR is only set by SGE in this case on the headnode of the job, it will go here into the SGE created and removed $TMPDIR, and to the /tmp directory on all the slave nodes. The removal of these directories on the slave nodes will take place in the stop_proc_args script of the PE, when the lamhalt command is executed. In most of the cases this will work as intended, whether the jobs are finished by a normal end of the program or stopped with qdel. Don't change the $TMPDIR to a different location in your jobscript, because the command mpirun can't execute correctly, as it won't find any information of the current setup. If you must change it for any reason, it must also be changed in the start_proc_args and stop_proc_args scripts of the PE.
As there is no environment on the slave nodes by SGE, this startup will behave most like an interactive startup, and hence need the path to the LAM/MPI binaries also on the slave nodes defined in any file, which will be sourced during a non-interactive login.
Hostfile a.k.a. Machinefile Format
Although LAM/MPI supports a machinefile format, where each node is suffixed by the number of slots to be used on this machine (i.e. "node00 cpu=2"), the default format of the generated machinefile is sufficient, where each machine is repeated times in this file, equal to the slots to be used there.
Setup of the PE in SGE
We will name this PE "lam_loose_rsh" and set the following entries:
$ qconf -sp lam_loose_rsh pe_name lam_loose_rsh queue_list para21 para22 slots 4 user_lists NONE xuser_lists NONE start_proc_args /home/reuti/lam_loose_rsh/startlam.sh $pe_hostfile stop_proc_args /home/reuti/lam_loose_rsh/stoplam.sh allocation_rule $round_robin control_slaves FALSE job_is_first_task TRUEThis setup is for SGE 5.3, where the queues to be used are defined in the PE. For SGE 6.x, the PE must be specified in the cluster queue, in which it should be used in the entry "pe_list".
Sample job execution
Most likely you have already a MPI program based on LAM/MPI, which is running successfully in interactive mode. Otherwise you will find a sample program in the supplied archive to this Howto, which will run until a qdel of the job, so that the process distribution to the nodes can easily be checked. Let's take this job script:
$ cat tester_mpi.sh #!/bin/sh export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH mpirun C /home/reuti/mpihelloSpecifying the option "C" to mpirun will use all of the slots granted to this job, which is most likely what you want all of the time. If all went right, we can see the job running, and also check the involved nodes for the running jobs:
$ qsub -pe lam_loose_rsh 2 tester_mpi.sh your job 10389 ("tester_mpi.sh") has been submitted $ qstat job-ID prior name user state submit/start at queue master ja-task-ID --------------------------------------------------------------------------------------------- 10389 0 tester_mpi reuti r 02/20/2005 15:25:31 para21 SLAVE 10389 0 tester_mpi reuti r 02/20/2005 15:25:31 para22 MASTER $ rsh node21 ps -e f -o pid,ppid,pgrp,command --cols=80 PID PPID PGRP COMMAND ... 8386 1 8386 /home/reuti/local/lam-7.1.1/bin/lamd -H 192.168.151.23 -P 4288 8387 8386 8386 \_ /home/reuti/mpihello $ rsh node22 ps -e f -o pid,ppid,pgrp,command --cols=80 PID PPID PGRP COMMAND ... 24576 1 24576 /home/reuti/local/lam-7.1.1/bin/lamd -H 192.168.151.23 -P 4288 24581 24576 24576 \_ /home/reuti/mpihelloUsing this setup, the issued qdel, which will execute the lamhalt in the stop_proc_args script of the PE, is the only way to remove the running processes and created directories on the nodes (except the directory on the headnode of the job, which is removed by SGE as usual).
Another more sophisticated solution, maybe in "pseudo code" is to "rsh <node> 'ps some_jobid | grep process_leader | (cut process_group) * -1 | xargs kill -9 --'" in a loop over all involved machines. As we are interested in a Tight integration, we will no further dig into this, and it's up to the reader to implement it.
Pitfall during startup of the DaemonsThe daemon on the nodes is started by the program hboot in LAM/MPI, which will fork a child process, which in turn will be replaced by the lamd. While this will work without any modification on the headnode of the job, where hboot is started locally without rsh, we will face a problem on the slave nodes: when the parent task ends, SGE will discover this and kill the whole processgroup of the job. This would also kill the just started lamd, as it is still in the same processgroup as hboot was before.
Modification
The solution to this problem is the command setsid, which will enforce a new processgroup for the child task. This has to be added in line 317 of hboot.c (which can be found in your lam-7.1.1 sourcecode installation in the directory tools/hboot), after the child forks. The program section:
else if (pid == 0) { /* child */ if (ao_taken(ad, "s")) {should become:
else if (pid == 0) { /* child */ setsid(); if (ao_taken(ad, "s")) {Then the executable hboot must be build again with the conventional make inside the lam-7.1.1 sourcecode installation, and the resulting binary hboot copied just by hand to the place, where the binaries were installed, e.g. ~/local/lam-7.1.1/bin.
No warranty
As usual, there is no guaranty, that this will work successfully on all Linux distributions and versions. Just do on your own risk.
Different behavior on the slave nodes compared to "Loose integration using rsh"Because we are now using qrsh to start the daemons, no rsh would be necessary between the nodes. It is just still defined here to have an easy access to the nodes. The used qrsh will now also create a temporary directory on the slave nodes and set the $TMPDIR accordingly. Unfortunately, this will be deleted when the job (i.e. the executed hboot) ends. So it can't be used to store the LAM/MPI created information of the daemon setup. As $TMPDIR is the only environment variable which can't be exported to the nodes by using the "-v" or "-V" option to qrsh, we have to use another way. Otherwise you will see the job starting as intended, and the lamhalt will also look successful, but you will still have the processes running on the slave nodes in case that you issued a qdel, as LAM/MPI stores in its special directory the PIDs of the processes, which need to be killed.
The supplied rsh-wrapper in the mpi subdirectory in $SGE_ROOT already has the option to supply a prefix command, which should be executed before the command on the slave node. This can be set in the start_proc_args script of the PE:
export RCMD_PREFIX="export TMPDIR=/tmp;"This way, the LAM/MPI created directories are in the same place as using just the plain rsh. Be sure, that you also changed the path to the actual location of your rsh-wrapper in the supplied script startlam.sh of the archive.
Setup of the PE in SGE
We will name this PE "lam_loose_qrsh" and set the following entries:
$ qconf -sp lam_loose_qrsh pe_name lam_loose_qrsh queue_list para21 para22 slots 4 user_lists NONE xuser_lists NONE start_proc_args /home/reuti/lam_loose_qrsh/startlam.sh -catch_rsh $pe_hostfile stop_proc_args /home/reuti/lam_loose_qrsh/stoplam.sh allocation_rule $round_robin control_slaves TRUE job_is_first_task TRUETo allow the execution of the qrsh commands, the control_slaves entry must be set to TRUE in this case. On the headnode of the job is no qrsh necessary, so we can safely set job_is_first_task to TRUE.
Behavior on platforms with process control by the Additional Group Feature
Although the started lamd would be in a new process group, the reliable shutdown of all the processes using the additional group feature (see: man setgroups) by SGE would also kill the just started daemon, as it can't escape from this additional group.
Disadvantages and Advantages of the Loose Integration
Which ever of the two loose integrations you choose, they share most of the characteristics in common, as the only difference is the usage of qrsh instead of the plain rsh:
- - wrong accounting
- - no processes/directories controlled by SGE, removal relies on the lamhalt command
If all went okay, the result in case of a normal program stop or an abort with qdel is:
- + no semaphores or shared memory segments left behind
- + processes and directories removed on the slave nodes
As the output is exactly the same as with the "Loose Integration using rsh", we skip it here, because it will not present any new information or facts. The scripts for this startup method are of course also included in the supplied script archive to this Howto.
2-stage Startup for Tight IntegrationFor now, there is no direct possibility in SGE to start a program in daemon-mode on a node, which is still under control of SGE for correct accounting and removal of the processes.
The original idea by Christopher Duncan in [1], was to make first a conventional qrsh to the slave node, and on this slave node a local-qrsh to start the final daemon. This is possible with LAM/MPI, as the daemon is not pushing itself into the background (like it is mostly done with daemonizing programs), but the starting hboot makes the forking. At this step, it is possible to insert the local-qrsh call and achieve the desired effect.
Additional changes to LAM/MPI
For this to work, it is necessary to use the already patched hboot, like in the chapter "Loose Integration using qrsh". Another change is now necessary to insert the local-qrsh. Therefore the following steps must be executed:
- move to your binary installation of LAM/MPI, e.g. ~/local/lam-7.1.1/bin
- rename lamd to lamd_binary
- create a file lamd_wrapper with the following content:
#!/bin/sh # # Modified startup of lamd to get it working in a controlled # setup by SGE. To be used for interactive usage like usual. # LAMD_BINARY=lamd_binary if [ "$ENVIRONMENT" = "BATCH" -a "$PE" = "lam_tight_qrsh" ] ; then COUNTER=5 while (( COUNTER-- > 0 )) ; do if (( `ps --no-headers -o ppid -p $$` != 1 )) ; then : else : exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $HOSTNAME $LAMD_BINARY "$@" fi done else exec $LAMD_BINARY "$@" fi # # If we are lucky, we never get here. # exit -1- make the lamd_wrapper executable, i.e.: chmod +x lamd_wrapper
- make a symbolic link from lamd to lamd_wrapper, i.e.: ln -s lamd_wrapper lamd
With this script, the installed LAM/MPI can still be used interactively or with the loose integration scripts. So nothing must be changed back, to get the initial behavior.
Setup of the PE in SGE
A new PE "lam_tight_qrsh" will reflect the changes for this tight integration:
$ qconf -sp lam_tight_qrsh pe_name lam_tight_qrsh queue_list para21 para22 slots 4 user_lists NONE xuser_lists NONE start_proc_args /home/reuti/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile stop_proc_args /home/reuti/lam_tight_qrsh/stoplam.sh allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSEAs we now make a qrsh also on the headnode of the parallel job, the job_is_first_task must be set to FALSE. It may happen, that we get only one slot on this machine.
Program flow of the wrapper script
SGE will control the number of qrsh calls made to each node, so the idea is to wait a small time, until the parent PID of this lamd_wrapper became 1. This will happen, when the starting hboot left already the machine.
[Side note: In case, that there are some race conditions, and SGE still prevents the second local-qrsh to be made (although the parent PID became already 1), the allocation_rule could be changed to be just 2 (and so limited to dual CPU machines); this would give the exact behavior as Christopher Duncan's initial Perl script. In this case there is no a need to wait at all, and the entry job_is_first_task could be changed to TRUE (one local qrsh is at least allowed in this case).]
On some slow systems, there is the chance to increase the number in the wait loop to a higher value, although it wasn't observed also under already loaded machines that the waiting loop in the lamd_wrapper was ever executed. It's just in the script, to avoid an endless loop, in case anything went wrong.
Again, the startlam.mpi needs to be adjusted, so that the actual location of the rsh wrapper is used here.
Sample execution
Using the same commands like in the first example, the results show the tight integration of LAM/MPI:
$ qsub -pe lam_tight_qrsh 2 tester_mpi.sh your job 10395 ("tester_mpi.sh") has been submitted $ qstat job-ID prior name user state submit/start at queue master ja-task-ID --------------------------------------------------------------------------------------------- 10395 0 tester_mpi reuti r 02/20/2005 18:39:25 para21 MASTER 0 tester_mpi reuti r 02/20/2005 18:39:25 para21 SLAVE 10395 0 tester_mpi reuti r 02/20/2005 18:39:25 para22 SLAVE $ rsh node21 ps -e f -o pid,ppid,pgrp,command --cols=80 PID PPID PGRP COMMAND ... 789 1 789 /usr/sge/bin/glinux/sge_execd 8945 789 8945 \_ sge_shepherd-10395 -bg 8993 8945 8993 | \_ /bin/sh /var/spool/sge/node21/job_scripts/10395 8994 8993 8993 | \_ mpirun C /home/reuti/mpihello 8972 789 8972 \_ sge_shepherd-10395 -bg 8973 8972 8973 \_ /usr/sge/utilbin/glinux/rshd -l 8975 8973 8975 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg 8976 8975 8976 \_ lamd_binary -H 192.168.151.22 -P 55513 -n 0 -o 8995 8976 8976 \_ /home/reuti/mpihello 8971 1 8971 /usr/sge/bin/glinux/qrsh -V -inherit -nostdin node21 lamd_bina 8974 8971 8971 \_ /usr/sge/utilbin/glinux/rsh -n -p 55517 node21 exec '/usr/ $ rsh node22 ps -e f -o pid,ppid,pgrp,command --cols=80 PID PPID PGRP COMMAND ... 789 1 789 /usr/sge/bin/glinux/sge_execd 25180 789 25180 \_ sge_shepherd-10395 -bg 25181 25180 25181 \_ /usr/sge/utilbin/glinux/rshd -l 25183 25181 25183 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg 25184 25183 25184 \_ lamd_binary -H 192.168.151.22 -P 55513 -n 1 -o 25185 25184 25184 \_ /home/reuti/mpihello 25179 1 25179 /usr/sge/bin/glinux/qrsh -V -inherit -nostdin node22 lamd_bina 25182 25179 25179 \_ /usr/sge/utilbin/glinux/rsh -n -p 49142 node22 exec '/usr/The escaped processes don't execute any time or resource consuming program, and can just stay there out of control of SGE. The real executing lamd and child processes are under SGE control, which was the goal to be achieved by this setup. These processes will be killed when they finish on their own, or after issuing a qdel. In the latter case, it is not possible to get a proper shutdown by using lamhalt, since lamd was already shutdown and the directory containing the LAM-MPI specific information removed SGE.
Advantages and Disadvantages of the Tight Integration
With this setup we trade some advantages for the disadvantages:
- + correct accounting
- + processes/directories controlled by SGE, definitely removed at the end of the job
The drawback is in this case:
- - semaphores or shared memory segments may be left behind, if the job was aborted with qdel
Depending on the usage of the nodes, there may be a need to make some cleanup of the semaphores or shared memory segments from time to time, or in the stop_proc_args of the PE, in case there are often jobs killed by qdel.
In case of the shutdown of the processes by the additional group feature, the other way round may be successful: usage of the "Tight integration with qrsh" and waiting at the end of hboot, until the lamd is up and running, instead of the inserted setsid. Although the starting local-qrsh will be removed in this case (and so not appear in a process listing like the above one), the started lamd will continue to operate and service for the program execution.
SGE-LAM Integration[1] Latest version of the script sge-lam by Christopher Duncan. Because it can only be found in the mailing list archive at SGE and LAM/MPI, it's attached here for reference (sge-lam). As there was a small bug inside, this version was patched to remove any "-n" for the remote-qrsh call; this was taken from a former version of this script, where it was included.
[2] Archive with all the scripts used in this Howto: sge-lam-integration-scripts.tgz.
LAM/MPI
The implementation specific documentation can be found at the LAM/MPI website on the page "Documentation" (http://www.lam-mpi.org/using/docs/).
MPI documentation in general and tutorials
For some general introduction to MPI and MPI-Programming, you can study the following documents:
- http://www.mpi-forum.org/docs
- http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
- http://www-unix.mcs.anl.gov/mpi/usingmpi/index.html
- http://www-unix.mcs.anl.gov/mpi/usingmpi2
- ftp://math.usfca.edu/pub/MPI/mpi.guide.ps
- http://www.science.uva.nl/research/scs/edu/pscs/guide.pdf
- http://www.science.uva.nl/research/scs/edu/distr/guide_to_the_practical_work.pdf