Tight PVM Integration in Grid Engine

Topic:

Loose and tight integration of the PVM library into SGE.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.0 -- 2005-03-26 Initial release, comments and corrections are welcome

Contents:

Advantages of a Tight Integration
Prerequisites
Loose Integration
Additions for Tight Integration
Tight Integration
Nodes with more than one network interface
Restrictions and Future Work
References and Documents

Note:

This HOWTO complements the information contained in the $SGE_ROOT/pvm directory of the Grid Engine distribution.

Advantages of a Tight Integration

The original included files for the PVM integration offer only a loose integration of PVM into SGE. This means, that the necessary daemons (which build up the PVM), will neither be under control of SGE, nor that all files on the slave nodes (created by PVM for its internal management) will be correctly deleted in case of a job abort. Also the accounting will not be correct, as the via rsh started PVM daemons on the slave nodes are not related in any way to SGE.
With Tight Integration on the other hand, you will have all this, plus that there is also no need to have rsh between the nodes enabled, as all startups of the daemons on the slave nodes are handled by the built-in qrsh of SGE.

Prerequisites

Configuration of SGE with qconf or the GUI
You should already know how to change settings in SGE, like to setup and change a queue definition or the entries in the PE configuration. Additional information about queues and parallel interfaces you can get from the man pages "queue_conf" and "sge_pe" of SGE (make sure the SGE man pages are defined in your $MANPATH).

Target platform

This Howto targets the PVM version 3.4.5 on SGE 6.0. Some new platforms were added in the necessary build step while creating the supporting start_pvm and stop_pvm programs. The other three included programs are only examples, and not needed for the integration. For users of SGE 5.3, a backport is also available.

PVM

The PVM (Parallel Virtual Machine) is a framework from the Oak Ridge National Laboratory (http://www.csm.ornl.gov/pvm/pvm_home.html) and provides an interface for parallel programs, which allows the MPMD (multiple program multiple data) paradigm. Before you start with the integration of PVM into SGE, you should already be familiar with the operation of PVM outside SGE, like starting the daemons and requesting information about the started virtual machine from the pvm-console. Although this is not directly necessary for the integration into SGE, it will ease the understanding of the applied configurations (and detection of failures in operation in case that something went wrong). There isn't any patch or modification necessary to the original distribution of PVM.

Included setups and scripts

The supplied archive in [1] will supersede the provided $SGE_ROOT/pvm. It contains modified scripts and programs of the original distribution of the PVM integration package in SGE, which enables now also the Tight Integration. So you may first save the original $SGE_ROOT/pvm, before you untar this package in the same location. In case that you are using (and will continue to use) the Loose Integration, you can still live with this new package, as by default it mimics the original behavior and your already existing configurations should continue to work without any change.

Another short running program is provided in [2], which will allow you to observe the correct distribution of the spawned tasks and removal of files in /tmp.

Loose Integration

As there is only one distribution of $SGE_ROOT/pvm now, the necessary steps to install this new package will be the same as in the original integration package. After you set your working directory to be your $SGE_ROOT and untared the supplied archive in this location as your admin user for SGE, you will have to compile some helping programs. To do this, just change to $SGE_ROOT/pvm/src and set some necessary environment variables like:
$ export PVM_ROOT=/home/reuti/pvm3
$ export PVM_ARCH=LINUX
$ export SGE_ROOT=/usr/sge
The correct locations you have of course to adjust to your custom installation. Be aware, that the PVM_ARCH is different by naming convention from the architecture SGE is detecting. On a Macintosh this leads e.g. to the situation, that SGE is referring it as "darwin", and PVM as "DARWIN" (for Linux it may be "lx24-x86" for SGE and "LINUX" for PVM). With these settings you can compile the programs by just issuing ./aimk in $SGE_ROOT/pvm/src. You should now got a directory (e.g. lx24-x86) and inside the compiled programs. The script ./install.sh executed in the same location will copy these files to the correct target in $SGE_ROOT/pvm/bin/<arch>, where startpvm.sh and stoppvm.sh is expecting them. These two programs start_pvm and stop_pvm (besides the three examples) will help the startpvm.sh and stoppvm.sh scripts to set up the PVM in a proper way. The start_pvm program will fork a child process, which will start the PVM, while the parent will watchout for the correct startup, and only be successful if the requested PVM could be established. The stop_pvm sends just a halt command to the PVM.

Using the command line qconf or the qmon GUI, a sample setting for a PVM PE may look like:
$ qconf -sp pvm
pe_name           pvm
slots             46
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/sge/pvm/startpvm.sh $pe_hostfile $host /home/reuti/pvm3
stop_proc_args    /usr/sge/pvm/stoppvm.sh $pe_hostfile $host
allocation_rule   1
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min
This setup is for SGE 6.x, where the PE must be specified in the cluster queue, in which it should be used in the entry "pe_list". For SGE 5.3 the queues to be used are defined in the PE already. To have a working example which will run for some seconds, we can compile the hello.c and hello_other.c in your home directoty (taken from the PVM User's Guide, you can find it in the supplied pvm_hello.tgz archive) with:
$ gcc -o hello hello.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3
$ gcc -o hello_other hello_other.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3
Prepared with this, we can submit a simple PVM job and observe the bahvior on the nodes:
$ qsub -pe pvm 2 tester_loose.sh
The start script of the PE will start the daemons on the (in this case two) nodes, and the started program will spawn two additional processes on each pvmd (besides the main routine). You may implement such a logic in case that the main process hello will only collect the work of the hello_other tasks and not doing any active work on its own.
$ rsh node16 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  787     1   787 /usr/sge/bin/glinux/sge_commd
  789     1   789 /usr/sge/bin/glinux/sge_execd
 2999   789  2999  \_ sge_shepherd-11233 -bg
 5340   789  5340  \_ sge_shepherd-11262 -bg
 5359  5340  5359      \_ /bin/sh /var/spool/sge/node16/job_scripts/11262
 5360  5359  5359          \_ ./hello
 5350     1  5341 /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11262.1.para16/hostfile
 5361  5350  5341  \_ /home/reuti/hello_other
On the slave node only the escaped pvmd and started hello_other can be found:
$ rsh node08 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  787     1   787 /usr/sge/bin/glinux/sge_commd
  789     1   789 /usr/sge/bin/glinux/sge_execd
25950     1 25940 /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode08 1 c0a89711:8110 4080 2 c0a
25951 25950 25940  \_ /home/reuti/hello_other
All files will go in this case to the default directory for PVM: /tmp.
$ rsh node16 ls -lh /tmp               
total 68K
drwxr-xr-x    2 reuti    users        4.0K Mar 25 23:21 11262.1.para16
drwx------    2 root     root          16K Apr 23  2004 lost+found
-rw-------    1 reuti    users          19 Mar 25 23:21 pvmd.502
-rw-------    1 reuti    users         127 Mar 25 23:21 pvml.502
srwxr-xr-x    1 reuti    users           0 Mar 25 23:21 pvmtmp005350.0
$ rsh node08 ls -lh /tmp               
total 64K
drwx------    2 root     root          16K Apr 23  2004 lost+found
-rw-------    1 reuti    users          19 Mar 25 23:21 pvmd.502
-rw-------    1 reuti    users         126 Mar 25 23:21 pvml.502
srwxr-xr-x    1 reuti    users           0 Mar 25 23:21 pvmtmp025940.0
In case of a proper shutdown of the job and the PVM, all the files but the pvml.502 (where the used number reflects your user ID) will be deleted again. Unless you enable the out-commented PVM_VMID in the start/stop script of this PE (and use it also in your job script by setting: "PVM_VMID=$JOB_ID; export PVM_VMID"), you are limited to one PVM per user per machine. A setting of PVM_VMID to the SGE supplied $JOB_ID will append this value to all PVM generated files, and makes them so unique for each job.

In case of a job abort via qdel, the started daemons will proper shutdown, but the started programs hello_other may continue to work.

Additions for Tight Integration

To achieve a Tight Integration, some changes were necessary to the already supplied Loose Integration scripts:

The first started pvmd (which is also responsible for starting the other pvmds on the slave nodes) has to be started by a local qrsh. This is done by a change to start_pvm.c to distinguish between the two startup modes. In case of a Tight Integration, it will now assemble a call to rsh instead of only exec'ing the pvmd directly.

The rsh-wrapper from the $SGE_ROOT/mpi was taken as a starting point and modified, to direct the rsh call in start_pvm to a qrsh. In addition, the temporary directory for PVM has to be set, to be the SGE created temporary directory. This is done by prefixing the call to pvmd with "env PVM_TMP=\$TMPDIR". This way PVM_TMP will be set on the target node, and not on the source node.

Any built-in rsh command in PVM will be replaced, by setting already in startpvm.sh "PVM_RSH=rsh; export PVM_RSH", which will honor the SGE rsh-wrapper this way.

The first via qrsh started pvmd will now in turn call qrsh again to start the pvmds on the slave nodes. Because the default behavior is to fork into daemon land, the rsh-wrapper will append the option "-f" to the pvmd command, if it discovers that the rsh-call is not a local one. So the rsh-wrapper behaves in fact in two different ways.

Tight Integration

Having already set up everything for the Loose Integration, you can switch to the Tight Integration by simply changing the definition of your PE and introducing -catch_rsh to the start procedure of the PE (and reversing the settings of control_slaves and job_is_first_task of TRUE and FALSE):
$ qconf -sp pvm
pe_name           pvm
slots             46
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host /home/reuti/pvm3
stop_proc_args    /usr/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host
allocation_rule   1
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
The only thing left to do is now to set PVM_TMP like in tester_tight.sh in your job script, as in this location your executing program will look for information about the current PVM:

export PVM_TMP=$TMPDIR

After submitting the job in exact the same way as before (but this time taking the script tester_tight.sh in the qsub command):
$ qsub -pe pvm 2 tester_tight.sh
you should see a distribution to the head node of your job like:
$rsh node18 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  788     1   788 /usr/sge/bin/glinux/sge_commd
  791     1   791 /usr/sge/bin/glinux/sge_execd
 7147   791  7147  \_ sge_shepherd-11238 -bg
 7183  7147  7183  |   \_ /bin/sh /var/spool/sge/node18/job_scripts/11238
 7184  7183  7183  |       \_ ./hello
 7161   791  7161  \_ sge_shepherd-11238 -bg
 7162  7161  7162      \_ /usr/sge/utilbin/glinux/rshd -l
 7164  7162  7164          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node18/active_jobs
 7166  7164  7166              \_ /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11238.1.para18/hostfile
 7185  7166  7166                  \_ /home/reuti/hello_other
 7158     1  7148 /usr/sge/bin/glinux/qrsh -V -inherit node18 env PVM_TMP=$TMPDIR /home/reuti/pvm3/l
 7163  7158  7148  \_ /usr/sge/utilbin/glinux/rsh -p 46806 node18 exec '/usr/sge/utilbin/glinux/qrsh
 7165  7163  7148      \_ [rsh <defunct>]
 7173     1  7166 /usr/sge/bin/glinux/qrsh -V -inherit node20 env PVM_TMP=$TMPDIR $PVM_ROOT/lib/pvmd
 7181  7173  7166  \_ /usr/sge/utilbin/glinux/rsh -p 53007 node20 exec '/usr/sge/utilbin/glinux/qrsh
 7182  7181  7166      \_ [rsh <defunct>]
The important thing is, that the daemon and the started programs hello and hello_other are under full SGE control. Also the created files in the usual /tmp will instead be placed in the SGE created private directory for this job:
$ rsh node18 ls -h /tmp/11238.1.para18                  
hostfile
pid.1.node18
pvmd.502
pvml.502
pvmtmp007166.0
qrsh_client_cache
rsh
The same can be observed on the (in this case only one) slave nodes:
$ rsh node20 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  787     1   787 /usr/sge/bin/glinux/sge_commd
  790     1   790 /usr/sge/bin/glinux/sge_execd
15397   790 15397  \_ sge_shepherd-11238 -bg
15398 15397 15398      \_ /usr/sge/utilbin/glinux/rshd -l
15399 15398 15399          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node20/active_jobs
15400 15399 15400              \_ /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode20 1 c0a89713:814c
15401 15400 15400                  \_ /home/reuti/hello_other
$ rsh node20 ls -h /tmp/11238.1.para20
pid.1.node20
pvmd.502
pvml.502
pvmtmp015400.0
Regardlessly, whether the job will terminate by the intended program end or aborted with qdel: all processes and intermediate files will be cleanly removed. And we don't have to honor PVM_VMID, as all files are already separated into different temporary directories.

Nodes with more than one network interface

If your cluster has two (or more) network interfaces in the nodes, you also have to set the start_proc_args to catch the call to hostname by -catch_hostname in case that you are not using the primary interface for PVM communication, as this will be used by the rsh_wrapper to distinguish between a local qrsh call and one to the slave nodes. In this hostname-wrapper, you must change the response from the original hostname call to give the name of the internal interface, which is also the one used in the SGE supplied $pe_hostfile. If you want to have the PVM communication on a complete other interface then SGE is aware of, you may have a look into $SGE_ROOT/mpi/startmpi.sh for mapping the $pe_hostfile supplied names also in $SGE_ROOT/pvm/startpvm.sh in a similar way (besides the necessary -catch_hostname).

Restrictions and Future Work

As SGE is granting the nodes/slots to your job, you shouldn't use any pvm_addhosts(), pvm_delhosts() or similar calls in your program, as this would violate the SEG scheduling of jobs to the cluster.
PVM uses internally a scheduler on it's own, to distribute the by pvm_spawn() initiated tasks to the nodes (unless you use the option to specify the target node directly in pvm_spawn() on your own [according to the SGE granted nodes/slots of course]). The built-in default is just a round robin scheme across all the granted nodes. So, setting a fixed allocation rule of 1 (or for dual CPU machines of 2) is the only safe setting, although you may spawn too many tasks to all the nodes, if your request to the SGE PE doesn't match the internal logic of your program.

To allow a more flexible allocation of the nodes in SGE, a scheduler for PVM is in preparation, which will honor the SGE supplied allocation of slots per node. There is still only one daemon per node running, but the replacement scheduler will be responsible to get the correct distribution of the started processes in case of uneven granted slots on the nodes.

References and Documents

SGE-PVM Integration
[1] Archive with all the scripts used in this Howto: pvm60.tgz [for older installations using SGE 5.3: pvm53.tgz].
[2] Archive with a small PVM program from the PVM User's Guide: pvm_hello.tgz.

PVM

The latest version of PVM and build instructions can be downloaded from (http://www.netlib.org/pvm3).

PVM documentation in general and tutorials

For some general introduction to PVM and PVM-Programming, you can study the following documents:

http://www.netlib.org/pvm3/book/pvm-book.ps

http://www.netlib.org/pvm3/refcard.ps