This document describes how to install Grid Engine on machines with multiple network interafces (multi-homed host). A special case, using the SolarisTM Operating Environment IP Multipathing (IPMP) technology for IP failover, is described in a separate HOWTO.
Suppose we have two ethernet interfaces (say hme0 and hme1) on each machine. One interface is associated with general traffic, NFS file sharing and so on
(hme0) and the other is dedicated to Grid Engine communications (hme1). We would like to set up Grid Engine so that it communicates only on the Grid Engine dedicated network. In this example, there is a Grid Engine master node (sun-master) and three exec hosts (sun-1, sun-2, sun-3)
In /etc, there will be file called hostname.hme0 populated with the hostname (eg sun-1) and another file called hostname.hme1 populated with the grid engine interface name (eg sun-1-grid). In the /etc/hosts file, there should of course, be entries for the SGE interface as well as the standard interface
# # Grid Engine Network # 192.168.7.2 sun-master-grid 192.168.7.3 sun-1-grid 192.168.7.11 sun-2-grid 192.168.7.12 sun-3-grid
When both networks are functioning correctly, install gridengine.
Install SGE on all hosts.
Under SGE_ROOT/SGE_CELL/common, create a file named host_aliases and populate as follows.
# cat host_aliases sun-master-grid sun-master sun-1-grid sun-1 sun-2-grid sun-2 sun-3-grid sun-3
Check that SGE can resolve all the hostnames correctly:
# cd /gridware/sge/utilbin/solaris64 # ./gethostbyname -aname sun-1 sun-1-grid # ./gethostbyname -aname sun-1-grid sun-1-grid #
Shut down the exec hosts
# /etc/init.d/rcsge stop /gridware/sge/default/spool/sun-1-grid/active_jobs: No such file or directory Shutting down Grid Engine communication daemon # ps -ef |grep sge #
And then also shut down the master, usiing the same command
Alter the file SGE_ROOT/SGE_CELL/common/act_qmaster to read the name of the masters' grid engine interface rather than the hostname (eg sun-master-grid)
Start up the master node now:
# /etc/init.d/rcsge starting sge_qmaster starting program: /gridware/sge/bin/solaris64/sge_commd using service "sge_commd" bound to port 536 Reading in complexes: Complex "host". Complex "queue". Reading in execution hosts. Reading in administrative hosts. Reading in submit hosts. Reading in usersets: Userset "defaultdepartment". Userset "deadlineusers". Reading in queues: Queue "sun-1.q". Queue "sun-2.q". Queue "sun-3.q". Reading in parallel environments: PE "make". Reading in scheduler configuration cant load sharetree (cant open file sharetree: No such file or directory), starting up with empty sharetree starting sge_schedd #
Now start up the SGE exec hosts
# /etc/init.d/rcsge start starting sge_execd starting program: /gridware/sge/bin/solaris64/sge_commd using service "sge_commd" bound to port 536 #
Snoop the network to check that the correct interfaces are being used:
# qsub -q sun-1 test.sh # snoop -V -d hme1 sun-1-grid sun-master-grid -> sun-1-grid TCP D=46883 S=536 Syn Ack=2694354350 \ Seq=2537161622 Len=0 Win=49640 Options=<mss 1460,nop,nop,sackOK> sun-1-grid -> sun-master-grid TCP D=536 S=46883 Ack=2537161623 \ Seq=2694354350 Len=0 Win=49640 Trademarks
Sun and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Sun et Solaris sont des marques déposées ou enregistrées de Sun Microsystems, Inc. aux Etats-Unis et dans d'autres pays.