Last updated:
The present HOWTO goes over some commonly seen problems experienced when using Grid Engine, and appropriate solutions. The information is presented in a tabular chart, using the following scheme:
Category |
|
---|---|
Symptom |
|
Cause |
Resolution |
For problems which are not explicitly mentioned here, search for a symptom in the appropriate category which matches your problem as closely as possible, and see if the resolution fixes your particular case.
My output file for my job says "Warning: no access to tty; thus no job control in this shell..." |
|
One or more of your login files contain an stty command. These commands are only useful if there is a terminal present. |
In Sun Grid Engine batch jobs, there is no termional associated with these jobs. You need to either remove all stty commands from your login files, or bracket them with an 'if' statement which checks for a terminal before processing. An example of this is below: /bin/csh: stty -g # checks terminal status if ($status == 0) # succeeds if a terminal is present <place all stty commands in here> endif |
The job standard error log file says: `tty`: Ambiguous but there is no reference to tty in user's shell which is called in job script |
|
shell_start_mode is by default posix_compliant; therefore, all job scripts run with the shell specified in the queue definition, not the one specified on the first line of the job script. |
use the -S flag to qsub, or change shell_start_mode to unix_behavior |
I can run my job script from the command line, but it fails when run via qsub. |
|
Process limits could be being set for your job. To test this, write a test script which does limit and limit -h. Execute both interactively at the shell prompt and through qsub to compare the result. |
make sure to remove any commands in configuration files which sets limits in your shell. |
qsub of a job results in the error "can't set additional group id for job" (seen in administrator or user mail, or shepherd trace file) and puts queue into error state |
|
Possible reasons
|
Corresponding solutions |
Jobs work when run from command line but fail when run via qsub |
|
Data and executables may not be accessible where needed |
The jobs script itself must be accessible from the submit host. All data and other executables needed by the script must be accessible on the execute host. Usually shared via NFS. |
Unlimited stack size set by default by SGE may cause some apps to crash on some OS's. |
In the job script, use “ulimit” to set stack size limits before calling the executable that crashes. Or modify the queue to set smaller stack size: qconf -mattr queue h_stack 8389486 <queue_name> (hard limit in bytes) qconf -mattr queue s_stack 8389486 <queue_name> (soft limit in bytes) |
Exec hosts report a load of 99.99; queue is in “alarm” and/or “unknown” state |
|
There are a few things that could cause your exec hosts to report a load of 99.99:
|
Depending on the cause, here are the appropriate solutions
|
A warning is printed to <cell>/spool/<host>/messages every 30 seconds. The messages look like this:
Tue Jan 23 21:20:46 2001|execd|meta|W|local configuration meta not defined - using global configuration But there IS a file for each host in <cell>/common/local_conf/, each with FDQN. |
|
The hostname resolving at your machine "meta" returnes the short name, while at your master machine "meta" with FQDN is returned. |
Make sure that all of your /etc/hosts files and your NIS table are consistent with this respect. In this example, there could be a line like
168.0.0.1 meta meta.your.domain in /etc/hosts of the host "meta" while it should be instead
168.0.0.1 meta.your.domain meta |
Occasionally I see "CHECKSUM ERROR", "WRITE ERROR" or "READ ERROR" messages in the "messages" files of the daemons. Do I need to worry about these? |
|
|
As long as these messages do not appear in a one second
interval (they |
Jobs will finish on a particular queue: qmaster/messages: Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1 finished on host exechost But then the following errors are seen on the exec host:
exechost/messages:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory "active_jobs/490.1" for reaping job 490.1
exechost/messages:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory "active_jobs/490.1": opendir(active_jobs/490.1) failed: Input/output error |
|
The $SGE_ROOT directory, which is automounted, is being unmounted, causing the sge_execd to lose its cwd. |
Use a local spool directory for you execd host. Set the parameter execd_spool_dir using qmon or qconf. |
“critical error: can't connect commd” “critical error: setup failed starting cod_schedd” |
|
A bug on 32 bit systems: rlim_fd_max > 1024 in /etc/system |
Set rlim_fd_max to < 1024. Or update to SGE 5.3p2 or higher |
The actual hostname <myhostname> of the machine is in alias to /localhost in etc/hosts. Looks like this: 127.0.0.1 localhost myhostname |
remove <myhostname> as an alias to localhost and put <myhostname> after the real IP-address in /etc/hosts |
Multiple queues cascade into error state, rendering the grid unusable. |
|
errors in a user's .cshrc/.profile result in setting all queues in error state |
|
Memory leak and huge memory consumption for schedd on large systems |
|
Parameter
|
Set
|
max_u_jobs doesn't work as expected. |
|
It doesn't work exactly the same way in all versions of the product – and affects scheduling differently depending on whether the product is used in SGE or SGEEE mode. |
Update to SGE 5.3p2 (or higher) which contains the latest implementation. |
Submitting interactive jobs with qrsh, I get the error:
% qrsh -l mem_free=1G error: error: no suitable queues Yet queues are available for batch jobs using qsub, and can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G. |
|
The message "error: no suitable queues" results from the "-w e" submit option which is active by default for interactive jobs like qrsh (look for "-w e" in for qrsh(1)). This option causes the submit command to fail, if the qmaster does not know for sure that the job will be dispatchable according to the current cluster configuration. The intension of this mechanism is to decline job requests in advance in case they can't be granted. |
In this case 'mem_free' is configured to be a consumable resource, but you have not specified the amount of memory being available at each the host. The memory load values are deliberately not considered for this check, because they vary, so they can't be seen as part of the cluster configuration. To overcome this you can either
|
qrsh wont dispatch to the same node it is on. From a qsh shell:
host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed:
host2 [50]% qrsh -inherit host4 hostname
host4
|
|
gid_range not sufficient. It should be defined as a range, not a single number. SGEEE assigns each job on a host a distinct gid. |
Adjust gid_range using 'qconf -mconf' or qmon. The suggested range is:
gid_range 20000-20100 |
when I do a qrsh, I get this error..
% qrsh error: 1: can't set additional group id for job |
|
The error message below can occur if the user already have 16 existing group ids set. SGE tries to set one more group id and fails b/c usually the limit is 16. |
Please check to see how many group ids are assigned to the user using 'id -a'. If it's more than 16, then you need to reduce this number or increase the limit in the kernel. |
qrsh -inherit -V does not work when used inside a parallel job:
cannot get connection to "qlogin_starter" |
|
This problem occurs with nested qrsh calls, and is due to the -V switch. The first qrsh -inherit call will set the environment variable TASK_ID (the id of the tightly integrated task within the parallel job). The second qrsh -inherit call will then use this environment variable for registration of its task which will fail as it tries to start a task with the same id as the already running first task. |
You can either
|
qrsh does not seem to work at all:
host2$ qrsh -verbose hostname local configuration host2 not defined - using global configuration waiting for interactive job to be scheduled ... Your interactive job 88 has been successfully scheduled. Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ... rcmd: socket: Permission denied /share/gridware/utilbin/solaris64/rsh exited with exit code 1 reading exit code from shepherd ... error: error waiting on socket for client to connect: Interrupted system call error: error reading returncode of remote command cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh host2$ |
|
Permissions for qrsh are not set properly |
Check the permissions of the following files. They are located in $SGE_ROOT/utilbin/. Note that rlogin and rsh need to be setuid and owned by root. -r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin* -r-s--x--x 1 root root 19808 Sep 18 06:00 rsh* -rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*
NOTE: the $SGE_ROOT directory also needs to be NFS-mounted with the "setuid" option. If it is mounted with "nosuid" from your submit client, then qrsh (and associated commands) will not work. |
Interactive jobs fail when run via qsh, without error message. |
|
DISPLAY variable may be set incorrectly |
Set DISPLAY correctly. Or to get error messages for this situation - upgrade to release 5.3p2 or higher |
When trying to start a distributed make qmake exits with the following error message: qrsh_starter: executing child process qmake failed: No such file or directory |
|
Grid Engine will start an instance of qmake on the execution host. If the Grid Engine environment (esp. the PATH) is not setup in the users shell resource file (.profile/.cshrc) this qmake call will fail. |
Use the -v option to export the PATH to the qmake job. A typical qmake call is qmake -v PATH -cwd -pe make 2-10 -- |
When doing qmake, the error seen is:
waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 5 Your "qrsh" request could not be scheduled, try again later. |
|
ARCH variable could be set incorrectly in shell which called qmake
|
Set ARCH variable correctly to a supported value matching a host available in your cluster, or else specify the correct value at submit time, eg, qmake -v ARCH=solaris64 ... |
Why am I am having refresh problems when running Qmon? |
|
Your system needs to be updated with the proper Xserver patch. |
X-server Patch 105633-48 fixes the problem of the icons not refreshing properly as well as icons painting over non-qmon windows. It is important to install the patch in console mode (no X running) otherwise the patch installation will fail |
Parts of Sun HPC ClusterTools parallel jobs (job script itself, child processes, etc) fail to stop when terminated by user or by qmaster. |
|
The user may not have supplied the necessary means (scripts) for SGE to control the distributed jobs. |
Follow the complete HOW-TO instructions on Integration between Grid Engine and HPC Cluster Tools. |
Bugs in early versions of loose integration package |
Update to SGE 5.3p2 (or higher) which includes latest MPI loose integration package |
Parallel jobs that run with the tight integration of SGE5.3.x and HPC CT 5 are not terminated if one of the queues has wall clock limit set. |
|
A bug in SGE prevented correct signal delivery to all parallel processes |
SGE 5.3p4 contains the fix; for earlier 5.3.x versions, get corresponding patches from Sunsolve: SGE: 113136-04 (pkgadd Solaris 32-bit); 113137-04 (pkgadd Solaris 64-bit); 113138-04 (pkgadd Solaris X86); 113663-02 (pkgadd common pkg); 113849-03 (tar.gz Solaris 32-bit); 113850-03 (tar.gz Solaris 64-bit); 113851-03 (tar.gz Solaris X86); 113852-04 (tar.gz Linux); 113853-02 (tar.gz common package) SGEEE: 113139-04 (pkgadd Solaris 32-bit); 113140-04 (pkgadd Solaris 64-bit); 113636-03 (pkgadd common pkg); 113855-03 (tar.gz Solaris 32-bit); 113856-03 (tar.gz Solaris 64-bit); 113900-02 (tar.gz Linux); 113857-02 (tar.gz common package) |
Parallel jobs that run with the tight integration of SGE5.3.x and HPC CT 5 would not suspend and resume correctly. |
|
Another bug in SGE prevented STOP and CONT signals to be correctly delivered to all processes. |
Need to set the suspend/resume methods in the queues used for the parallel jobs with the appropriate scripts. These scripts can either be downloaded from the Grid Engine Download Site. or obtained from Sun support. Releases beyond 5.3p4 will ship with these two scripts, a README file and a parallel environment template. |
After failover to shadow master, the schedd daemon remains running on the original qmaster |
|
This is a bug in earlier versions of SGE. |
Update to 5.3p2 or higher |
Shadow host fails to own mastership of SGE cluster |
|
Lock file exists. |
Remove $SGE_ROOT/<cell>/spool/qmaster/lock file if
master host has crashed or can no longer function as
qmaster. |
Root R/W access to $SGE_ROOT directory and its sub-directories should be from both master and shadow. |
Adjust permissions for root r/w access to the $SGE_ROOT directory and its sub-directories from shadow host. NOTE: please see the Shadow Master HOWTO |