sge_shepherd - Sun Grid Engine single job controlling agent


     sge_shepherd provides the parent process functionality for a
     single  Sun  Grid  Engine  job.  The parent functionality is
     necessary on UNIX systems to retrieve resource usage  infor-
     mation (see getrusage(2)) after a job has finished. In addi-
     tion, the sge_shepherd forwards signals to the job, such  as
     the  signals  for  suspension, enabling, termination and the
     Sun Grid Engine checkpointing signal  (see  sge_ckpt(1)  for

     The sge_shepherd receives information about the  job  to  be
     started  from the sge_execd(8).  During the execution of the
     job it actually starts up to 5 child processes. First a pro-
     log  script  is run if this feature is enabled by the prolog
     parameter in the cluster configuration.  (See  sge_conf(5).)
     Next  a parallel environment startup procedure is run if the
     job is a parallel job. (See sge_pe(5) for more information.)
     After  that,  the  job itself is run, followed by a parallel
     environment  shutdown  procedure  for  parallel  jobs,   and
     finally  an epilog script if requested by the epilog parame-
     ter in the cluster  configuration.  The  prolog  and  epilog
     scripts  as  well  as  the  parallel environment startup and
     shutdown procedures are to  be  provided  by  the  Sun  Grid
     Engine  administrator  and  are  intended  for site-specific
     actions to be taken before and after execution of the actual
     user job.

     After the job has finished and the  epilog  script  is  pro-
     cessed,  sge_shepherd  retrieves  resource  usage statistics
     about the job, places them in a job specific subdirectory of
     the  sge_execd(8)  spool  directory  for  reporting  through
     sge_execd(8) and finishes.

     sge_shepherd also places an exit status file  in  the  spool
     directory.  This  exit  status  can  be viewed with qacct -j
     JobId  (see  qacct(1));  it  is  not  the  exit  status   of
     sge_shepherd  itself  but  of one of the methods executed by
     sge_shepherd. This exit status can  have  several  meanings,
     depending  on  in  which  method an error occurred (if any).
     The possible  methods  are:  prolog,  parallel  start,  job,
     parallel  stop,  epilog, suspend, restart, terminate, clean,
     migrate, and checkpoint.

     The following exit values are returned:

     0      All methods: Operation was executed successfully.
     99     Job script, prolog and epilog: When FORBID_RESCHEDULE
            is  not  set  in the configuration (see sge_conf(5)),
            the job gets re-queued.  Otherwise see "Other".

     100    Job script, prolog and epilog:  When  FORBID_APPERROR
            is  not  set  in the configuration (see sge_conf(5)),
            the job gets re-queued.  Otherwise see "Other".

     Other  Job script: This  is  the  exit  status  of  the  job
            itself.  No  action  is  taken  upon this exit status
            because the meaning of this exit status is not known.
            Prolog, epilog and parallel start: The queue  is  set
            to error state and the job is re-queued.
            Parallel stop: The queue is set to error  state,  but
            the  job is not re-queued. It is assumed that the job
            itself ran successfully and only the clean up  script
            Suspend,  restart,  terminate,  clean,  and  migrate:
            Always successful.
            Checkpoint: Success, except for kernel checkpointing:
            checkpoint  was  not  successful, did not happen (but
            migration will happen by Sun Grid Engine).

     sge_shepherd should not be invoked  manually,  but  only  by

     sgepasswd  contains  a  list  of  user  names   and    their
     corresponding  encrypted  passwords. If available, the pass-
     word  file  will  be   used   by   sge_shepherd.  To  change
     the  contents of this file please use the sgepasswd command.
     It is not advised to  change that file manually.
     <execd_spool>/job_dir/<job_id>     job specific directory

     sge_intro(1), sge_conf(5), sge_execd(8).

     See sge_intro(1) for a full statement of rights and  permis-

Man(1) output converted with man2html