Sometimes a pending job is obviously runnable, but does not get dispatched. Grid Engine can be asked for the reason:
qstat -j <jobid>
If enabled qstat -j <jobid> provides the user with information enlisting the reasons why a certain job has not been dispatched in the last scheduling run. This monitoring can be enabled/disabled as it can cause undesired communication overhead between Schedd and Qmaster (see under 'schedd_job_info' in Grid Engine sched_conf(5) Man Page ). Here is a sample output:
% qstat -j 242059 scheduling info: queue "fangorn.q" dropped because it is temporarily not available queue "lolek.q" dropped because it is temporarily not available queue "balrog.q" dropped because it is temporarily not available queue "saruman.q" dropped because it is full cannot run in queue "bilbur.q" because it is not contained in its hard queue list (-q) cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q) has no permission for host "ori"
This information is generated directly by Schedd and takes the current utilization of the cluster into account. Sometimes this is not exactly what you are interested in: E.g. if all queue slots are already occupied by jobs of other users, no detailed message is generated for the job you are interested in.
qalter -w v <jobid>
This command enlists the reasons why a job is not dispatchable in principle. For this purpose a dry scheduling run is performed. The special with this dry scheduling run is that all consumable resources (also slots) are considered to be fully available for this job. Similarly all load values are ignored because they are varying.
Job or queue errors are indicated by an uppercase "E" in the qstat output. A job enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the job. A queue enters the error state when Grid Engine tried to execute a job in a queue, but it failed for a reason that is specific to the queue.
Grid Engine offers a set of possiblities for users and administrators to get diagnosis information in case of job execution errors. Since both the queue and the job error state result from a failed job execution the diagnosis possibilities are applicable to both types of error states:
query for job error reason (not before 6.0)
Since Grid Engine 6.0 for jobs in error state a one-line error reason is available through
qstat -j| grep error
With a 6.0 this is the recommended first source of diagnosis information for the end user.
query for queue error reason (not before 6.0)
Since Grid Engine 6.0 for queues in error state a one-line error reason is available through
qstat -explain E
With a 6.0 this is the recommended first source of diagnosis information for administrators in case of queue errors.
user abort mail
If jobs are submitted with the submit option "-m a" a abort mail is sent to the adress specified with the "-M user[@host]" option. The abort mail contains diagnosis information about job errors and are the recommended source of information for users.
qacct accounting
If no abort mail is available the user can run
qacct -j
to get information about the job error from Grid Engine job accounting.
administrator abort mail
An administrator can order admistrator mails about job execution problems by specifying an appropriate email adress (see under administrator_mail in Grid Engine sge_conf(5) Man Page ). Administrator mails contain more detailed diagnosis information than user abort mails and are the recommended in case of frequent job execution errors.
messages files
If no administrator mail is available the Qmasters messages file should be first investigated. Loggings related to a certain job can be found by searching for the appropriate job ID. In the 'default' installation the Qmaster messages file is located at
$SGE_ROOT/default/spool/qmaster/messages
Additional information can be sometimes found in the messages of the Execd where the job was started. Use qacct -j <jobid> to figure out the host where the job was started and search in
$SGE_ROOT/default/spool/<host>/messagesfor the jobid.