Topic:
Loose and tight integration of the PVM library into SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany
Version:
1.0 -- 2005-03-26 Initial release, comments and corrections are welcome
Contents:
Note:
This HOWTO complements the information contained in the $SGE_ROOT/pvm directory of the Grid Engine distribution.
The original included files for the PVM integration offer only a loose integration of PVM into SGE. This means, that the necessary daemons (which build up the PVM), will neither be under control of SGE, nor that all files on the slave nodes (created by PVM for its internal management) will be correctly deleted in case of a job abort. Also the accounting will not be correct, as the via rsh started PVM daemons on the slave nodes are not related in any way to SGE.With Tight Integration on the other hand, you will have all this, plus that there is also no need to have rsh between the nodes enabled, as all startups of the daemons on the slave nodes are handled by the built-in qrsh of SGE.
Configuration of SGE with qconf or the GUIYou should already know how to change settings in SGE, like to setup and change a queue definition or the entries in the PE configuration. Additional information about queues and parallel interfaces you can get from the man pages "queue_conf" and "sge_pe" of SGE (make sure the SGE man pages are defined in your $MANPATH).
Target platform
This Howto targets the PVM version 3.4.5 on SGE 6.0. Some new platforms were added in the necessary build step while creating the supporting start_pvm and stop_pvm programs. The other three included programs are only examples, and not needed for the integration. For users of SGE 5.3, a backport is also available.
PVM
The PVM (Parallel Virtual Machine) is a framework from the Oak Ridge National Laboratory (http://www.csm.ornl.gov/pvm/pvm_home.html) and provides an interface for parallel programs, which allows the MPMD (multiple program multiple data) paradigm. Before you start with the integration of PVM into SGE, you should already be familiar with the operation of PVM outside SGE, like starting the daemons and requesting information about the started virtual machine from the pvm-console. Although this is not directly necessary for the integration into SGE, it will ease the understanding of the applied configurations (and detection of failures in operation in case that something went wrong). There isn't any patch or modification necessary to the original distribution of PVM.
Included setups and scripts
The supplied archive in [1] will supersede the provided $SGE_ROOT/pvm. It contains modified scripts and programs of the original distribution of the PVM integration package in SGE, which enables now also the Tight Integration. So you may first save the original $SGE_ROOT/pvm, before you untar this package in the same location. In case that you are using (and will continue to use) the Loose Integration, you can still live with this new package, as by default it mimics the original behavior and your already existing configurations should continue to work without any change.
Another short running program is provided in [2], which will allow you to observe the correct distribution of the spawned tasks and removal of files in /tmp.
As there is only one distribution of $SGE_ROOT/pvm now, the necessary steps to install this new package will be the same as in the original integration package. After you set your working directory to be your $SGE_ROOT and untared the supplied archive in this location as your admin user for SGE, you will have to compile some helping programs. To do this, just change to $SGE_ROOT/pvm/src and set some necessary environment variables like:$ export PVM_ROOT=/home/reuti/pvm3 $ export PVM_ARCH=LINUX $ export SGE_ROOT=/usr/sgeThe correct locations you have of course to adjust to your custom installation. Be aware, that the PVM_ARCH is different by naming convention from the architecture SGE is detecting. On a Macintosh this leads e.g. to the situation, that SGE is referring it as "darwin", and PVM as "DARWIN" (for Linux it may be "lx24-x86" for SGE and "LINUX" for PVM). With these settings you can compile the programs by just issuing ./aimk in $SGE_ROOT/pvm/src. You should now got a directory (e.g. lx24-x86) and inside the compiled programs. The script ./install.sh executed in the same location will copy these files to the correct target in $SGE_ROOT/pvm/bin/<arch>, where startpvm.sh and stoppvm.sh is expecting them. These two programs start_pvm and stop_pvm (besides the three examples) will help the startpvm.sh and stoppvm.sh scripts to set up the PVM in a proper way. The start_pvm program will fork a child process, which will start the PVM, while the parent will watchout for the correct startup, and only be successful if the requested PVM could be established. The stop_pvm sends just a halt command to the PVM.
Using the command line qconf or the qmon GUI, a sample setting for a PVM PE may look like:
$ qconf -sp pvm pe_name pvm slots 46 user_lists NONE xuser_lists NONE start_proc_args /usr/sge/pvm/startpvm.sh $pe_hostfile $host /home/reuti/pvm3 stop_proc_args /usr/sge/pvm/stoppvm.sh $pe_hostfile $host allocation_rule 1 control_slaves FALSE job_is_first_task TRUE urgency_slots minThis setup is for SGE 6.x, where the PE must be specified in the cluster queue, in which it should be used in the entry "pe_list". For SGE 5.3 the queues to be used are defined in the PE already. To have a working example which will run for some seconds, we can compile the hello.c and hello_other.c in your home directoty (taken from the PVM User's Guide, you can find it in the supplied pvm_hello.tgz archive) with:
$ gcc -o hello hello.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3 $ gcc -o hello_other hello_other.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3Prepared with this, we can submit a simple PVM job and observe the bahvior on the nodes:
$ qsub -pe pvm 2 tester_loose.shThe start script of the PE will start the daemons on the (in this case two) nodes, and the started program will spawn two additional processes on each pvmd (besides the main routine). You may implement such a logic in case that the main process hello will only collect the work of the hello_other tasks and not doing any active work on its own.
$ rsh node16 ps -e f -o pid,ppid,pgrp,command --cols=100 PID PPID PGRP COMMAND 787 1 787 /usr/sge/bin/glinux/sge_commd 789 1 789 /usr/sge/bin/glinux/sge_execd 2999 789 2999 \_ sge_shepherd-11233 -bg 5340 789 5340 \_ sge_shepherd-11262 -bg 5359 5340 5359 \_ /bin/sh /var/spool/sge/node16/job_scripts/11262 5360 5359 5359 \_ ./hello 5350 1 5341 /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11262.1.para16/hostfile 5361 5350 5341 \_ /home/reuti/hello_otherOn the slave node only the escaped pvmd and started hello_other can be found:
$ rsh node08 ps -e f -o pid,ppid,pgrp,command --cols=100 PID PPID PGRP COMMAND 787 1 787 /usr/sge/bin/glinux/sge_commd 789 1 789 /usr/sge/bin/glinux/sge_execd 25950 1 25940 /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode08 1 c0a89711:8110 4080 2 c0a 25951 25950 25940 \_ /home/reuti/hello_otherAll files will go in this case to the default directory for PVM: /tmp.
$ rsh node16 ls -lh /tmp total 68K drwxr-xr-x 2 reuti users 4.0K Mar 25 23:21 11262.1.para16 drwx------ 2 root root 16K Apr 23 2004 lost+found -rw------- 1 reuti users 19 Mar 25 23:21 pvmd.502 -rw------- 1 reuti users 127 Mar 25 23:21 pvml.502 srwxr-xr-x 1 reuti users 0 Mar 25 23:21 pvmtmp005350.0 $ rsh node08 ls -lh /tmp total 64K drwx------ 2 root root 16K Apr 23 2004 lost+found -rw------- 1 reuti users 19 Mar 25 23:21 pvmd.502 -rw------- 1 reuti users 126 Mar 25 23:21 pvml.502 srwxr-xr-x 1 reuti users 0 Mar 25 23:21 pvmtmp025940.0In case of a proper shutdown of the job and the PVM, all the files but the pvml.502 (where the used number reflects your user ID) will be deleted again. Unless you enable the out-commented PVM_VMID in the start/stop script of this PE (and use it also in your job script by setting: "PVM_VMID=$JOB_ID; export PVM_VMID"), you are limited to one PVM per user per machine. A setting of PVM_VMID to the SGE supplied $JOB_ID will append this value to all PVM generated files, and makes them so unique for each job.
In case of a job abort via qdel, the started daemons will proper shutdown, but the started programs hello_other may continue to work.
To achieve a Tight Integration, some changes were necessary to the already supplied Loose Integration scripts:
- The first started pvmd (which is also responsible for starting the other pvmds on the slave nodes) has to be started by a local qrsh. This is done by a change to start_pvm.c to distinguish between the two startup modes. In case of a Tight Integration, it will now assemble a call to rsh instead of only exec'ing the pvmd directly.
- The rsh-wrapper from the $SGE_ROOT/mpi was taken as a starting point and modified, to direct the rsh call in start_pvm to a qrsh. In addition, the temporary directory for PVM has to be set, to be the SGE created temporary directory. This is done by prefixing the call to pvmd with "env PVM_TMP=\$TMPDIR". This way PVM_TMP will be set on the target node, and not on the source node.
- Any built-in rsh command in PVM will be replaced, by setting already in startpvm.sh "PVM_RSH=rsh; export PVM_RSH", which will honor the SGE rsh-wrapper this way.
- The first via qrsh started pvmd will now in turn call qrsh again to start the pvmds on the slave nodes. Because the default behavior is to fork into daemon land, the rsh-wrapper will append the option "-f" to the pvmd command, if it discovers that the rsh-call is not a local one. So the rsh-wrapper behaves in fact in two different ways.
Having already set up everything for the Loose Integration, you can switch to the Tight Integration by simply changing the definition of your PE and introducing -catch_rsh to the start procedure of the PE (and reversing the settings of control_slaves and job_is_first_task of TRUE and FALSE):$ qconf -sp pvm pe_name pvm slots 46 user_lists NONE xuser_lists NONE start_proc_args /usr/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host /home/reuti/pvm3 stop_proc_args /usr/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host allocation_rule 1 control_slaves TRUE job_is_first_task FALSE urgency_slots minThe only thing left to do is now to set PVM_TMP like in tester_tight.sh in your job script, as in this location your executing program will look for information about the current PVM:
export PVM_TMP=$TMPDIRAfter submitting the job in exact the same way as before (but this time taking the script tester_tight.sh in the qsub command):
$ qsub -pe pvm 2 tester_tight.shyou should see a distribution to the head node of your job like:
$rsh node18 ps -e f -o pid,ppid,pgrp,command --cols=100 PID PPID PGRP COMMAND 788 1 788 /usr/sge/bin/glinux/sge_commd 791 1 791 /usr/sge/bin/glinux/sge_execd 7147 791 7147 \_ sge_shepherd-11238 -bg 7183 7147 7183 | \_ /bin/sh /var/spool/sge/node18/job_scripts/11238 7184 7183 7183 | \_ ./hello 7161 791 7161 \_ sge_shepherd-11238 -bg 7162 7161 7162 \_ /usr/sge/utilbin/glinux/rshd -l 7164 7162 7164 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node18/active_jobs 7166 7164 7166 \_ /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11238.1.para18/hostfile 7185 7166 7166 \_ /home/reuti/hello_other 7158 1 7148 /usr/sge/bin/glinux/qrsh -V -inherit node18 env PVM_TMP=$TMPDIR /home/reuti/pvm3/l 7163 7158 7148 \_ /usr/sge/utilbin/glinux/rsh -p 46806 node18 exec '/usr/sge/utilbin/glinux/qrsh 7165 7163 7148 \_ [rsh <defunct>] 7173 1 7166 /usr/sge/bin/glinux/qrsh -V -inherit node20 env PVM_TMP=$TMPDIR $PVM_ROOT/lib/pvmd 7181 7173 7166 \_ /usr/sge/utilbin/glinux/rsh -p 53007 node20 exec '/usr/sge/utilbin/glinux/qrsh 7182 7181 7166 \_ [rsh <defunct>]The important thing is, that the daemon and the started programs hello and hello_other are under full SGE control. Also the created files in the usual /tmp will instead be placed in the SGE created private directory for this job:
$ rsh node18 ls -h /tmp/11238.1.para18 hostfile pid.1.node18 pvmd.502 pvml.502 pvmtmp007166.0 qrsh_client_cache rshThe same can be observed on the (in this case only one) slave nodes:
$ rsh node20 ps -e f -o pid,ppid,pgrp,command --cols=100 PID PPID PGRP COMMAND 787 1 787 /usr/sge/bin/glinux/sge_commd 790 1 790 /usr/sge/bin/glinux/sge_execd 15397 790 15397 \_ sge_shepherd-11238 -bg 15398 15397 15398 \_ /usr/sge/utilbin/glinux/rshd -l 15399 15398 15399 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node20/active_jobs 15400 15399 15400 \_ /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode20 1 c0a89713:814c 15401 15400 15400 \_ /home/reuti/hello_other $ rsh node20 ls -h /tmp/11238.1.para20 pid.1.node20 pvmd.502 pvml.502 pvmtmp015400.0Regardlessly, whether the job will terminate by the intended program end or aborted with qdel: all processes and intermediate files will be cleanly removed. And we don't have to honor PVM_VMID, as all files are already separated into different temporary directories.
If your cluster has two (or more) network interfaces in the nodes, you also have to set the start_proc_args to catch the call to hostname by -catch_hostname in case that you are not using the primary interface for PVM communication, as this will be used by the rsh_wrapper to distinguish between a local qrsh call and one to the slave nodes. In this hostname-wrapper, you must change the response from the original hostname call to give the name of the internal interface, which is also the one used in the SGE supplied $pe_hostfile. If you want to have the PVM communication on a complete other interface then SGE is aware of, you may have a look into $SGE_ROOT/mpi/startmpi.sh for mapping the $pe_hostfile supplied names also in $SGE_ROOT/pvm/startpvm.sh in a similar way (besides the necessary -catch_hostname).
As SGE is granting the nodes/slots to your job, you shouldn't use any pvm_addhosts(), pvm_delhosts() or similar calls in your program, as this would violate the SEG scheduling of jobs to the cluster.PVM uses internally a scheduler on it's own, to distribute the by pvm_spawn() initiated tasks to the nodes (unless you use the option to specify the target node directly in pvm_spawn() on your own [according to the SGE granted nodes/slots of course]). The built-in default is just a round robin scheme across all the granted nodes. So, setting a fixed allocation rule of 1 (or for dual CPU machines of 2) is the only safe setting, although you may spawn too many tasks to all the nodes, if your request to the SGE PE doesn't match the internal logic of your program.
To allow a more flexible allocation of the nodes in SGE, a scheduler for PVM is in preparation, which will honor the SGE supplied allocation of slots per node. There is still only one daemon per node running, but the replacement scheduler will be responsible to get the correct distribution of the started processes in case of uneven granted slots on the nodes.
SGE-PVM Integration[1] Archive with all the scripts used in this Howto: pvm60.tgz [for older installations using SGE 5.3: pvm53.tgz].
[2] Archive with a small PVM program from the PVM User's Guide: pvm_hello.tgz.
PVM
The latest version of PVM and build instructions can be downloaded from (http://www.netlib.org/pvm3).
PVM documentation in general and tutorials
For some general introduction to PVM and PVM-Programming, you can study the following documents: