Reducing and Eliminating NFS usage by Grid Engine

The default installation of Grid Engine assumes that the $SGE_ROOT directory is on a shared filesystem accessible by all hosts in the cluster. For a large cluster, this could entail significant NFS traffic. There are various ways to reduce this traffic, including a way to eliminate entirely the requirement that Grid Engine operate using shared files. However, for each alternative, there is a subsequent loss of convenience, and in some cases, functionality. This HOWTO explains how to implement the different alternatives.

Levels of Grid Engine NFS dependencies

Note: color indicates at each level which part of the file structure below is moved out of NFS sharing

SGE_ROOT



Configuration

Description

Advantage

Disadvantage

default

executables, configuration files, spool directories: all shared

simple to install
easy to upgrade
easy to debug

potentially significant NFS traffic

local spool directories

executables, configuration files: shared.

spool directories: local to each compute host

simple to install

easy to upgrade

significant reduction in NFS traffic

less convenient to debug (must go to individual host to see execd messages file)

local executable files

configuration files: shared

executables, spool directories: local to each compute host

near-elimination of NFS traffic (NOTE: consequences especially seen when running massively parallel jobs across many nodes)

less convenient to install and upgrade (must modify files on every host)

less convenient to debug

local configuration files

executables, configuration files, spool directories: all local to each compute host

elimination of NFS requirement

less convenient to install and upgrade

less convenient to debug

less convenient to change some configuration parameters (must modify files on every host)

loss of shadow master functionality; partial loss of qacct functionality




Local Spool Directories

The spool directory for each execd is the greatest source of NFS traffic for Grid Engine. When jobs are dispatched to an exec host, the job script gets transferred via the commd and then written to the spool directory. Each job gets its own subdirectory, into which additional information is written by both the execd and the job shepherd process. Logfiles are also written into the spool directory, for both the execd as well as the individual jobs.

By configuring local spool directories, all that traffic can be redirected to the local disk on each compute host, thus isolating it from the rest of the network as well as reducing the I/O latency. One disadvantage is that, in order to view the logfiles for a particular job, you need to log onto the system where the job ran, instead of simply looking in the shared directory. But, this would only be necessary for detailed debugging of a job problem.

The path to the spool directory controlled by the parameter execd_spool_dir; it should be set to a directory on the local compute host which is owned by the admin user and which ideally can handle intensive reading/writing (e.g., /var/spool/sge). The execd_spool_dir parameter can be specified when running the install_qmaster script. However, this directory must already existed and be owned by the admin user, or else the script will complain and the execd will not function properly. Alternatively, the execd_spool_dir parameter can be changed in the cluster configuration (man sge_conf); the execds need to be halted before this change can be made. Please make you read sge_conf(5)

Local Executables

In the default setup, all hosts in a cluster read the binary files for daemons and commands off the shared directory. For daemons, this only occurs once, when they start up. When jobs run, other processes are invoked, such as the shepherd and the rshd (for interactive jobs). In a high-throughput cluster, or when invoking a massively-parallel job across many nodes, there is a possibility that many simultaneous NFS read accesses to these other executables could occur. To counter this, you could make all executables be local to the compute hosts.

In this configuration, rather than sharing $SGE_ROOT over NFS to the compute hosts, you would only share $SGE_ROOT/$SGE_CELL/common (you would also implement local spool directories as described above). On each compute host, you would need to install both the "common" and the architecture-specific binary packages. Then, you would mount the shared $SGE_ROOT/$SGE_CELL/common directory before invoking the install_execd script. In order to prevent confusion, make sure that the path to $SGE_ROOT is identical on the master host and compute hosts, e.g., SGE_ROOT=/opt/sge on all hosts.

For submit and admin hosts, you could choose to either install the executables locally, or else mount them from some shared version of $SGE_ROOT, since it is unlikely that NFS traffic on these types of hosts would be a cause for concern in terms of performance.

Local Configuration Files

Although the above two setups describe ways to reduce NFS traffic to almost nil, there might be other reasons why NFS is not desired. For example, the only available version of NFS for your operating environment might not be considered reliable enough for production use. In this case, you can choose not to share the configuration directory $SGE_ROOT/$SGE_CELL/common, but instead have it be local to each compute host. This would result in no files being shared via NFS. However, because you are no longer using a common set of files shared by all systems, there is some functionality which requires some extra effort to use, and other functionality which no longer works.

1) When you modify certain configuration files, the modification would need to be made manually across all hosts in the cluster. These files are located in the $SGE_ROOT/$SGE_CELL/common directory:

2) Another consequence is that the qacct command will only work if executed on the master host. This is because the accounting file, where all historical information is stored, is only updated on the master host. Because qacct will by default read information from the file $SGE_ROOT/$SGE_CELL/common/accounting, it will only be accurate on the master host. qacct can be directed to read information from any file, using the -f flag, so one alternative is to manually copy the accounting file periodically to another system, where the analysis can take place.

3) Finally, if you do not share the $SGE_ROOT/$SGE_CELL/common directory, you cannot use the Shadow Master facility. The Shadow Master feature relies upon a shared filesystem to keep track of the active master, so without NFS, Shadow Mastering does not work.

To install with this type of setup, proceed as follows:

  1. unpack/untar the Grid Engine distribution on each system (common and architecture-specific packages) to the same pathname on each system

  2. install the master host completely

  3. modify all the configuration files mentioned above to suit the requirements of your site

  4. on the master hosts make an archive of the directory $SGE_ROOT/$SGE_CELL/common

  5. on each exec host, unpack the archive created above

  6. on each exec host, run the install_execd script. It should automatically read in the configurations from the directory which was unpacked.

Other Considerations

Even though Sun Grid Engine can function perfectly well without NFS (except the noted functionality), there are other considerations which might lead to unexpected behavior.

Home directories

Unless otherwise specified, Grid Engine runs jobs in the user's home directory. If this is not shared, then whatever files are created will be placed in the home directory on the host where the job is executed. Also, any configuration given in dot-files, such as .cshrc and .sge_requests, will be read out of the home directory on the host where the job is executed . Finally, if the home directory of the user actually does not exist on the compute host, the job will go into an error state. You need to make sure that for every user, and on every compute host, a home directory is present and contains all the desired dot-file configurations. Also, for jobs run with the -cwd flag, the current path will be recorded, and when the job executes on the compute host, unless the exact same path is accessible to the user running the job, the job will go into an error state.

Application and data files

Obviously, without NFS there needs to be a way to stage data files in and out, and the application files (binaries, libraries, config files, databases, etc.) would also need to be either already present on each compute host or also staged in. The prolog and epilog script feature of Grid Engine provides a generic mechanism for implementing a site-specific stage-in/stage-out facility. Alternatively, these steps could be embedded into jobs scripts directly.

User virtualization

If application availability and data file staging were accounted for, one could in principle run Grid Engine without NFS over a WAN. However, part of the Grid Engine built-in authentication is that the username of the user submitting a job must be recognized on the compute host where the job runs. If running across administrative domains, the username might not exist on the target exec host. In this case, some of the solutions include: