The default installation of Grid Engine assumes that the $SGE_ROOT
directory is on a shared filesystem accessible by all hosts in the
cluster. For a large cluster, this could entail significant NFS
traffic. There are various ways to reduce this traffic, including a
way to eliminate entirely the requirement that Grid Engine operate
using shared files. However, for each alternative, there is a
subsequent loss of convenience, and in some cases, functionality.
This HOWTO explains how to implement the different alternatives.
Note: color indicates at each level which part of the file structure below is moved out of NFS sharing
SGE_ROOT
executables, including bin, util, utilbin
SGE_CELL
common
spool
Configuration |
Description |
Advantage |
Disadvantage |
---|---|---|---|
default |
executables, configuration files, spool directories: all shared |
simple to install |
potentially significant NFS traffic |
local spool directories |
executables, configuration files: shared. spool directories: local to each compute host |
simple to install easy to upgrade significant reduction in NFS traffic |
less convenient to debug (must go to individual host to see execd messages file) |
local executable files |
configuration files: shared executables, spool directories: local to each compute host |
near-elimination of NFS traffic (NOTE: consequences especially seen when running massively parallel jobs across many nodes) |
less convenient to install and upgrade (must modify files on every host) less convenient to debug |
local configuration files |
executables, configuration files, spool directories: all local to each compute host |
elimination of NFS requirement |
less convenient to install and upgrade less convenient to debug less convenient to change some configuration parameters (must modify files on every host) loss of shadow master functionality; partial loss of qacct functionality |
The spool directory for each execd is the greatest source of NFS traffic for Grid Engine. When jobs are dispatched to an exec host, the job script gets transferred via the commd and then written to the spool directory. Each job gets its own subdirectory, into which additional information is written by both the execd and the job shepherd process. Logfiles are also written into the spool directory, for both the execd as well as the individual jobs.
By configuring local spool directories, all that traffic can be redirected to the local disk on each compute host, thus isolating it from the rest of the network as well as reducing the I/O latency. One disadvantage is that, in order to view the logfiles for a particular job, you need to log onto the system where the job ran, instead of simply looking in the shared directory. But, this would only be necessary for detailed debugging of a job problem.
The path to the spool
directory controlled by the parameter execd_spool_dir; it
should be set to a directory on the local compute host which is owned
by the admin user and which ideally can handle intensive
reading/writing (e.g., /var/spool/sge
). The
execd_spool_dir parameter can be specified when running the
install_qmaster script. However, this directory must already existed
and be owned by the admin user, or else the script will complain and
the execd will not function properly. Alternatively, the
execd_spool_dir parameter can be changed in the cluster
configuration (man sge_conf); the execds need to be halted
before this change can be made. Please make you read sge_conf(5)
In the default setup, all hosts in a cluster read the binary files for daemons and commands off the shared directory. For daemons, this only occurs once, when they start up. When jobs run, other processes are invoked, such as the shepherd and the rshd (for interactive jobs). In a high-throughput cluster, or when invoking a massively-parallel job across many nodes, there is a possibility that many simultaneous NFS read accesses to these other executables could occur. To counter this, you could make all executables be local to the compute hosts.
In this configuration,
rather than sharing $SGE_ROOT
over NFS to the compute
hosts, you would only share $SGE_ROOT/$SGE_CELL/common
(you would also implement local spool directories as described
above). On each compute host, you would need to install both the
"common" and the architecture-specific binary packages.
Then, you would mount the shared $SGE_ROOT/$SGE_CELL/common
directory before invoking the install_execd script. In
order to prevent confusion, make sure that the path to $SGE_ROOT
is identical on the master host and compute hosts, e.g.,
SGE_ROOT=/opt/sge
on all hosts.
For submit and admin hosts, you could choose to either install the
executables locally, or else mount them from some shared version of
$SGE_ROOT
, since it is unlikely that NFS traffic on
these types of hosts would be a cause for concern in terms of
performance.
Although the above two setups describe ways to reduce NFS traffic
to almost nil, there might be other reasons why NFS is not desired.
For example, the only available version of NFS for your operating
environment might not be considered reliable enough for production
use. In this case, you can choose not to share the configuration
directory $SGE_ROOT/$SGE_CELL/common
, but instead have
it be local to each compute host. This would result in no files being
shared via NFS. However, because you are no longer using a common set
of files shared by all systems, there is some functionality which
requires some extra effort to use, and other functionality which no
longer works.
1) When you modify certain configuration files, the modification
would need to be made manually across all hosts in the cluster. These
files are located in the $SGE_ROOT/$SGE_CELL/common
directory:
sge_request:
default submit request flags;
(See: Grid Engine sge_request(5) Man Page)
host_aliases:
hostname aliasing;
(See: Grid Engine host_aliases(5) Man Page)
sge_aliases:
file path aliasing;
(See: Grid Engine sge_aliases(5) Man Page)
configuration
: most of the information in this
file does not need to be used by the exec hosts. However, there are
two parameters, ignore_fqdn and default_domain, which
are used by Grid Engine
(and Grid Scheduler) on all hosts. If the value of these
parameters is changed on the master hosts, then it also needs to be
reflected on all exec hosts in the cluster. Normally, these two
parameters would be set once, when the master host is installed, and
then not changed again. However, in case you experience network
problems, you may need to change these and see if it fixes the
problem.
2) Another consequence is that the qacct command will only
work if executed on the master host. This is because the accounting
file, where all historical information is stored, is only updated on
the master host. Because qacct will by default read information from
the file $SGE_ROOT/$SGE_CELL/common/accounting
, it will
only be accurate on the master host. qacct can be directed to read
information from any file, using the -f flag, so one alternative is
to manually copy the accounting file periodically to another system,
where the analysis can take place.
3) Finally, if you do not share the $SGE_ROOT/$SGE_CELL/common
directory, you cannot use the Shadow Master facility. The
Shadow Master feature relies upon a shared filesystem to keep track
of the active master, so without NFS, Shadow Mastering does not
work.
To install with this type of setup, proceed as follows:
unpack/untar the Grid Engine distribution on each system (common and architecture-specific packages) to the same pathname on each system
install the master host completely
modify all the configuration files mentioned above to suit the requirements of your site
on the master hosts make an archive of the directory
$SGE_ROOT/$SGE_CELL/common
on each exec host, unpack the archive created above
on each exec host, run the install_execd
script.
It should automatically read in the configurations from the
directory which was unpacked.
Even though Sun Grid Engine can function perfectly well without NFS (except the noted functionality), there are other considerations which might lead to unexpected behavior.
Unless otherwise specified, Grid Engine runs jobs in the user's
home directory. If this is not shared, then whatever files are
created will be placed in the home directory on the host where the
job is executed. Also, any configuration given in dot-files, such
as .cshrc
and .sge_requests
, will be read
out of the home directory on the host where the job is executed .
Finally, if the home directory of the user actually does not exist on
the compute host, the job will go into an error state. You need to
make sure that for every user, and on every compute host, a home
directory is present and contains all the desired dot-file
configurations. Also, for jobs run with the -cwd
flag,
the current path will be recorded, and when the job executes on the
compute host, unless the exact same path is accessible to the user
running the job, the job will go into an error state.
Obviously, without NFS there needs to be a way to stage data files in and out, and the application files (binaries, libraries, config files, databases, etc.) would also need to be either already present on each compute host or also staged in. The prolog and epilog script feature of Grid Engine provides a generic mechanism for implementing a site-specific stage-in/stage-out facility. Alternatively, these steps could be embedded into jobs scripts directly.
If application availability and data file staging were accounted for, one could in principle run Grid Engine without NFS over a WAN. However, part of the Grid Engine built-in authentication is that the username of the user submitting a job must be recognized on the compute host where the job runs. If running across administrative domains, the username might not exist on the target exec host. In this case, some of the solutions include:
allow users to log in as a common "grid user" (the
-A
submit flag could be used to distinguish the actual
identity of the user).
using a SUID wrapper to submit and administrative commands to do the same thing transparently