Grid Engine Trouble Shooting

Problem with a pending jobs not being dispatched

Sometimes a pending job is obviously runnable, but does not get dispatched. Grid Engine can be asked for the reason:

qstat -j <jobid>

If enabled qstat -j <jobid> provides the user with information enlisting the reasons why a certain job has not been dispatched in the last scheduling run. This monitoring can be enabled/disabled as it can cause undesired communication overhead between Schedd and Qmaster (see under 'schedd_job_info' in Grid Engine sched_conf(5) Man Page ). Here is a sample output:

% qstat -j 242059
scheduling info: queue "fangorn.q" dropped because it is temporarily not available
                 queue "lolek.q" dropped because it is temporarily not available
                 queue "balrog.q" dropped because it is temporarily not available
                 queue "saruman.q" dropped because it is full
                 cannot run in queue "bilbur.q" because it is not contained in its hard queue list (-q)
                 cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q)
                 has no permission for host "ori"

This information is generated directly by Schedd and takes the current utilization of the cluster into account. Sometimes this is not exactly what you are interested in: E.g. if all queue slots are already occupied by jobs of other users, no detailed message is generated for the job you are interested in.

qalter -w v <jobid>

This command enlists the reasons why a job is not dispatchable in principle. For this purpose a dry scheduling run is performed. The special with this dry scheduling run is that all consumable resources (also slots) are considered to be fully available for this job. Similarly all load values are ignored because they are varying.

Job or Queue goes in error state "E"

Job or queue errors are indicated by an uppercase "E" in the qstat output. A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the queue.

Grid Engine offers a set of possiblities for users and administrators to get diagnosis information in case of job execution errors. Since both the queue and the job error state result from a failed job execution the diagnosis possibilities are applicable to both types of error states:

query for job error reason (not before 6.0)

Since Grid Engine 6.0 for jobs in error state a one-line error reason is available through
```
qstat -j  | grep error
```
With a 6.0 this is the recommended first source of diagnosis information for the end user.

query for queue error reason (not before 6.0)

Since Grid Engine 6.0 for queues in error state a one-line error reason is available through
```
qstat -explain E
```
With a 6.0 this is the recommended first source of diagnosis information for administrators in case of queue errors.

user abort mail

If jobs are submitted with the submit option "-m a" a abort mail is sent to the adress specified with the "-M user[@host]" option. The abort mail contains diagnosis information about job errors and are the recommended source of information for users.

qacct accounting

If no abort mail is available the user can run
```
qacct -j 
```
to get information about the job error from Grid Engine job accounting.

administrator abort mail

An administrator can order admistrator mails about job execution problems by specifying an appropriate email adress (see under administrator_mail in Grid Engine sge_conf(5) Man Page ). Administrator mails contain more detailed diagnosis information than user abort mails and are the recommended in case of frequent job execution errors.

messages files

If no administrator mail is available the Qmasters messages file should be first investigated. Loggings related to a certain job can be found by searching for the appropriate job ID. In the 'default' installation the Qmaster messages file is located at
Additional information can be sometimes found in the messages of the Execd where the job was started. Use qacct -j <jobid> to figure out the host where the job was started and search in
for the jobid.

qmaster or other Grid Engine daemons keep crashing

Tell SGE daemons to not daemonize by setting the environment variable: SGE_ND

Start the SGE daemon from a shell. The daemon will now print debug information to stardard output.

Also, you may want to run the daemon under a debugger or strace (on Linux) to identify the location of the crash (and file a bug report!).

Alternatively, you can control the level of verbosity with the following shell commands:

source $SGE_ROOT/util/dl.sh (bash/sh) - dl.csh for csh/tcsh

dl level, where level = 0 (no debug output) to 10 (most detailed debug messages)

Then start the SGE daemon from a shell, any SGE daemons started from that session will dump the debugging messages to standard output. You can start the daemon under a debugger or use strace and other tools to debug it.