Common problems using Grid Engine

Last updated: Aug 26, 2003

The present HOWTO goes over some commonly seen problems experienced when using Grid Engine, and appropriate solutions. The information is presented in a tabular chart, using the following scheme:


Category

Symptom

Cause

Resolution

For problems which are not explicitly mentioned here, search for a symptom in the appropriate category which matches your problem as closely as possible, and see if the resolution fixes your particular case.

Categories:




Batch Submit

My output file for my job says

"Warning: no access to tty; thus no job control in this shell..."

One or more of your login files contain an stty command. These commands are only useful if there is a terminal present.

In Sun Grid Engine batch jobs, there is no termional associated with these jobs. You need to either remove all stty commands from your login files, or bracket them with an 'if' statement which checks for a terminal before processing. An example of this is below:

/bin/csh:
stty -g            # checks terminal status
if ($status == 0)  # succeeds if a terminal is present
<place all stty commands in here>
endif

The job standard error log file says:

`tty`: Ambiguous

but there is no reference to tty in user's shell which is called in job script

shell_start_mode is by default posix_compliant; therefore, all job scripts run with the shell specified in the queue definition, not the one specified on the first line of the job script.

use the -S flag to qsub, or change shell_start_mode to unix_behavior

I can run my job script from the command line, but it fails when run via qsub.

Process limits could be being set for your job. To test this, write a test script which does limit and limit -h. Execute both interactively at the shell prompt and through qsub to compare the result.

make sure to remove any commands in configuration files which sets limits in your shell.

qsub of a job results in the error "can't set additional group id for job" (seen in administrator or user mail, or shepherd trace file) and puts queue into error state

Possible reasons

  1. The error message below can occur if the user already have 16 existing group ids set. SGE tries to set one more group id and fails b/c usually the limit is 16.

  2. If you are not running Grid Engine as root, then the setgroups() command will fail trying to set the unique group ID which is used to track all the spawned processes of a job.

    Corresponding solutions

  1. Please check to see how many group ids are assigned to the user using 'id -a'. If it's more than 16, then you need to reduce this number or increase the limit in the kernel (NGROUPS_MAX).

  2. Be sure to run the Grid Engine daemons as root.

Jobs work when run from command line but fail when run via qsub

Data and executables may not be accessible where needed

The jobs script itself must be accessible from the submit host. All data and other executables needed by the script must be accessible on the execute host. Usually shared via NFS.

Unlimited stack size set by default by SGE may cause some apps to crash on some OS's.

In the job script, use “ulimit” to set stack size limits before calling the executable that crashes.

Or modify the queue to set smaller stack size:

qconf -mattr queue h_stack 8389486 <queue_name> (hard limit in bytes)
qconf -mattr queue s_stack 8389486 <queue_name> (soft limit in bytes)

Monitoring

Exec hosts report a load of 99.99; queue is in “alarm” and/or “unknown” state

There are a few things that could cause your exec hosts to report a load of 99.99:


  1. The execd is not running on the host.

  2. A default domain is incorrectly specified

  3. The qmaster host sees the exec host as a different name as the exec host sees itself.

Depending on the cause, here are the appropriate solutions


  1. Start up the execd as root on the host by running the $SGE_ROOT/default/common/rcsge script

  2. Run 'qconf -mconf' as the Sun Grid Engine administrator and change the default_domain to none.

  3. Set IGNORE_FQDN=TRUE for qmaster_params in cluster configuration.

  4. See man page sge_h_aliases(5)

  5. Please see the AppNote/HOWTO loadinfo for more information.

Miscellaneous Error Messages

A warning is printed to <cell>/spool/<host>/messages every 30 seconds. The messages look like this:


Tue Jan 23 21:20:46 2001|execd|meta|W|local 
configuration meta not defined - using global configuration 

But there IS a file for each host in <cell>/common/local_conf/, each with FDQN.

The hostname resolving at your machine "meta" returnes the short name, while at your master machine "meta" with FQDN is returned.

Make sure that all of your /etc/hosts files and your NIS table are consistent with this respect. In this example, there could be a line like


168.0.0.1 meta meta.your.domain

in /etc/hosts of the host "meta" while it should be instead


168.0.0.1 meta.your.domain meta

Occasionally I see "CHECKSUM ERROR", "WRITE ERROR" or "READ ERROR" messages in the "messages" files of the daemons. Do I need to worry about these?


As long as these messages do not appear in a one second interval (they
typically may appear between 1-30 times per day), there is no need
to do anything on this issue.

Jobs will finish on a particular queue: qmaster/messages:

Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1 finished on host exechost 

But then the following errors are seen on the exec host:


exechost/messages:


Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory 
"active_jobs/490.1" for reaping job 490.1


exechost/messages:


Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory 
"active_jobs/490.1": opendir(active_jobs/490.1) failed: Input/output error 

The $SGE_ROOT directory, which is automounted, is being unmounted, causing the sge_execd to lose its cwd.

Use a local spool directory for you execd host. Set the parameter execd_spool_dir using qmon or qconf.

“critical error: can't connect commd”
“critical error: setup failed starting cod_schedd”

A bug on 32 bit systems: rlim_fd_max > 1024 in /etc/system

Set rlim_fd_max to < 1024. Or update to SGE 5.3p2 or higher

The actual hostname <myhostname> of the machine is in alias to /localhost in etc/hosts. Looks like this:

127.0.0.1   localhost  myhostname

remove <myhostname> as an alias to localhost and put <myhostname> after the real IP-address in /etc/hosts

Multiple queues cascade into error state, rendering the grid unusable.

errors in a user's .cshrc/.profile result in setting all queues in error state

  1. Fix errors in users' .cshrc/.profile

  2. Use the -f option in the first line of the jobscript (i.e. Use “!#/bin/sh -f”) to bypass users' .cshrc or .profile

Performance

Memory leak and huge memory consumption for schedd on large systems

Parameter sched_job_info=true

Set sched_job_info= false or update to release 5.3p3 or higher

Configuration

max_u_jobs doesn't work as expected.

It doesn't work exactly the same way in all versions of the product – and affects scheduling differently depending on whether the product is used in SGE or SGEEE mode.

Update to SGE 5.3p2 (or higher) which contains the latest implementation.

Qrsh/Interactive Jobs

Submitting interactive jobs with qrsh, I get the error:


% qrsh -l mem_free=1G error: error: no suitable queues 

Yet queues are available for batch jobs using qsub, and can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.

The message "error: no suitable queues" results from the "-w e" submit option which is active by default for interactive jobs like qrsh (look for "-w e" in for qrsh(1)). This option causes the submit command to fail, if the qmaster does not know for sure that the job will be dispatchable according to the current cluster configuration. The intension of this mechanism is to decline job requests in advance in case they can't be granted.

In this case 'mem_free' is configured to be a consumable resource, but you have not specified the amount of memory being available at each the host. The memory load values are deliberately not considered for this check, because they vary, so they can't be seen as part of the cluster configuration. To overcome this you can either


  • omit this check generally by overriding qrsh's default setting "-w e" explicitly by submitting it with "-w n" (can also be put into $SGE_ROOT/<cell>/common/sge_request)

  • if you intend managing 'mem_free' as a consumbale resource specify the 'mem_free' capacity for your hosts in 'complex_values' of host_conf(5) by using 'qconf -me <hostname>'

  • if you don't intend managing 'mem_free' as consumable resource make it a non-consumable resource again in the 'consumable' column of complex(5) by using 'qconf -mc host'

qrsh wont dispatch to the same node it is on. From a qsh shell:


host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed: 

host2 [50]% qrsh -inherit host4 hostname
host4

gid_range not sufficient. It should be defined as a range, not a single number. SGEEE assigns each job on a host a distinct gid.

Adjust gid_range using 'qconf -mconf' or qmon. The suggested range is:


gid_range                 20000-20100

when I do a qrsh, I get this error..


% qrsh
error: 1: can't set additional group id for job

The error message below can occur if the user already have 16 existing group ids set. SGE tries to set one more group id and fails b/c usually the limit is 16.

Please check to see how many group ids are assigned to the user using 'id -a'. If it's more than 16, then you need to reduce this number or increase the limit in the kernel.

qrsh -inherit -V does not work when used inside a parallel job:


cannot get connection to "qlogin_starter"

This problem occurs with nested qrsh calls, and is due to the -V switch. The first qrsh -inherit call will set the environment variable TASK_ID (the id of the tightly integrated task within the parallel job). The second qrsh -inherit call will then use this environment variable for registration of its task which will fail as it tries to start a task with the same id as the already running first task.

You can either

  • unset TASK_ID before calling qrsh -inherit

  • not use the -V switch, but -v and export only the environment variables really needed.

qrsh does not seem to work at all:


host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration 
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ... 
error: error waiting on socket for client to connect: Interrupted system call
error: error reading returncode of remote command
cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh
host2$ 

Permissions for qrsh are not set properly

Check the permissions of the following files. They are located in $SGE_ROOT/utilbin/.

Note that rlogin and rsh need to be setuid and owned by root.

-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*


NOTE: the $SGE_ROOT directory also needs to be NFS-mounted with the "setuid" option. If it is mounted with "nosuid" from your submit client, then qrsh (and associated commands) will not work.

Interactive jobs fail when run via qsh, without error message.

DISPLAY variable may be set incorrectly

Set DISPLAY correctly. Or to get error messages for this situation - upgrade to release 5.3p2 or higher

Qmake

When trying to start a distributed make qmake exits with the following error message:

qrsh_starter: executing child process qmake failed: No such file or directory

Grid Engine will start an instance of qmake on the execution host. If the Grid Engine environment (esp. the PATH) is not setup in the users

shell resource file (.profile/.cshrc) this qmake call will fail.

Use the -v option to export the PATH to the qmake job. A typical qmake call is

qmake -v PATH -cwd -pe make 2-10 --

When doing qmake, the error seen is:


waiting for interactive job to be scheduled ...timeout (4 s) 
expired while waiting on socket fd 5

Your "qrsh" request could not be scheduled, try again later.

ARCH variable could be set incorrectly in shell which called qmake



Set ARCH variable correctly to a supported value matching a host available in your cluster, or else specify the correct value at submit time, eg,

qmake -v ARCH=solaris64 ...

Qmon

Why am I am having refresh problems when running Qmon?

Your system needs to be updated with the proper Xserver patch.

X-server Patch 105633-48 fixes the problem of the icons not refreshing properly as well as icons painting over non-qmon windows. It is important to install the patch in console mode (no X running) otherwise the patch installation will fail

Parallel/Checkpointing

Parts of Sun HPC ClusterTools parallel jobs (job script itself, child processes, etc) fail to stop when terminated by user or by qmaster.

The user may not have supplied the necessary means (scripts) for SGE to control the distributed jobs.

Follow the complete HOW-TO instructions on Integration between Grid Engine and HPC Cluster Tools.

Bugs in early versions of loose integration package

Update to SGE 5.3p2 (or higher) which includes latest MPI loose integration package

Parallel jobs that run with the tight integration of SGE5.3.x and HPC CT 5 are not terminated if one of the queues has wall clock limit set.

A bug in SGE prevented correct signal delivery to all parallel processes

SGE 5.3p4 contains the fix; for earlier 5.3.x versions, get corresponding patches from Sunsolve:

SGE: 113136-04 (pkgadd Solaris 32-bit); 113137-04 (pkgadd Solaris 64-bit); 113138-04 (pkgadd Solaris X86); 113663-02 (pkgadd common pkg); 113849-03 (tar.gz Solaris 32-bit); 113850-03 (tar.gz Solaris 64-bit); 113851-03 (tar.gz Solaris X86); 113852-04 (tar.gz Linux); 113853-02 (tar.gz common package)

SGEEE: 113139-04 (pkgadd Solaris 32-bit); 113140-04 (pkgadd Solaris 64-bit); 113636-03 (pkgadd common pkg); 113855-03 (tar.gz Solaris 32-bit); 113856-03 (tar.gz Solaris 64-bit); 113900-02 (tar.gz Linux); 113857-02 (tar.gz common package)

Parallel jobs that run with the tight integration of SGE5.3.x and HPC CT 5 would not suspend and resume correctly.

Another bug in SGE prevented STOP and CONT signals to be correctly delivered to all processes.

Need to set the suspend/resume methods in the queues used for the parallel jobs with the appropriate scripts. These scripts can either be downloaded from the Grid Engine Download Site. or obtained from Sun support.

Releases beyond 5.3p4 will ship with these two scripts, a README file and a parallel environment template.

Shadow Facility

After failover to shadow master, the schedd daemon remains running on the original qmaster

This is a bug in earlier versions of SGE.

Update to 5.3p2 or higher

Shadow host fails to own mastership of SGE cluster

Lock file exists.

Remove $SGE_ROOT/<cell>/spool/qmaster/lock file if master host has crashed or can no longer function as qmaster.
NOTE: to force the shadow host to take over from another master, use the “migrate” option, ie, “rcsge -migrate”.

Root R/W access to $SGE_ROOT directory and its sub-directories should be from both master and shadow.

Adjust permissions for root r/w access to the $SGE_ROOT directory and its sub-directories from shadow host.

NOTE: please see the Shadow Master HOWTO