checkpoint - Sun Grid Engine checkpointing environment  con-
     figuration file format

     Checkpointing is a facility to save the complete  status  of
     an  executing program or job and to restore and restart from
     this so called checkpoint at a later point of  time  if  the
     original  program  or job was halted, e.g.  through a system

     Sun Grid Engine provides  various  levels  of  checkpointing
     support  (see  sge_ckpt(1)).   The checkpointing environment
     described here is a means to configure the  different  types
     of  checkpointing in use for your Sun Grid Engine cluster or
     parts thereof. For that purpose you can  define  the  opera-
     tions  which  have to be executed in initiating a checkpoint
     generation, a migration of a checkpoint to another host or a
     restart of a checkpointed application as well as the list of
     queues which are eligible for a checkpointing method.

     Supporting different operating systems may easily force  Sun
     Grid  Engine  to introduce operating system dependencies for
     the configuration of the  checkpointing  configuration  file
     and  updates  of the supported operating system versions may
     lead to frequently changing implementation  details.  Please
     refer to the <sge_root>/ckpt directory for more information.

     Please use the -ackpt, -dckpt, -mckpt or -sckpt  options  to
     the  qconf(1)  command  to manipulate checkpointing environ-
     ments from the command-line or use the corresponding qmon(1)
     dialogue for X-Windows based interactive configuration.

     Note, Sun Grid Engine allows  backslashes  (\)  be  used  to
     escape  newline (\newline) characters. The backslash and the
     newline are replaced with a space (" ") character before any

     The format of a checkpoint file is defined as follows:

     The name of the checkpointing  environment  as  defined  for
     ckpt_name  in sge_types(1).  qsub(1) -ckpt switch or for the
     qconf(1) options mentioned above.

     The type of checkpointing to be used. Currently, the follow-
     ing types are valid:

          The   Hibernator   kernel   level   checkpointing    is

     cpr  The SGI kernel level checkpointing is used.

          The Cray kernel level checkpointing is assumed.

          Sun Grid Engine assumes that the  jobs  submitted  with
          reference  to this checkpointing interface use a check-
          pointing library such as provided by the public  domain
          package Condor.

          Sun Grid Engine assumes that the  jobs  submitted  with
          reference to this checkpointing interface perform their
          private checkpointing method.

          Uses all of the interface commands  configured  in  the
          checkpointing  object  like  in  the case of one of the
          kernel level checkpointing interfaces (cpr,  cray-ckpt,
          etc.) except for the restart_command (see below), which
          is not used (even if it  is  configured)  but  the  job
          script is invoked in case of a restart instead.

     A command-line type command string to  be  executed  by  Sun
     Grid Engine in order to initiate a checkpoint.

     A command-line type command string to  be  executed  by  Sun
     Grid  Engine  during a migration of a checkpointing job from
     one host to another.

     A command-line type command string to  be  executed  by  Sun
     Grid Engine when restarting a previously checkpointed appli-

     A command-line type command string to  be  executed  by  Sun
     Grid  Engine in order to cleanup after a checkpointed appli-
     cation has finished.

     A file system location to which checkpoints  of  potentially
     considerable size should be stored.

     A Unix signal to be sent to a job by Sun Grid Engine to ini-
     tiate  a checkpoint generation. The value for this field can
     either be a symbolic name from the list produced by  the  -l
     option  of  the  kill(1)  command or an integer number which
     must be a valid signal on the systems used  for  checkpoint-

     The points of time when checkpoints are expected to be  gen-
     erated.  Valid values for this parameter are composed by the
     letters s, m, x and r and any combinations  thereof  without
     any  separating  character  in between. The same letters are
     allowed for the -c option of the qsub(1) command which  will
     overwrite the definitions in the used checkpointing environ-
     ment.  The meaning of the letters is defined as follows:

     s    A job is checkpointed, aborted and if possible migrated
          if  the  corresponding sge_execd(8) is shut down on the
          job's machine.

     m    Checkpoints   are   generated   periodically   at   the
          min_cpu_interval  interval  defined  by  the queue (see
          queue_conf(5)) in which a job executes.

     x    A job is checkpointed, aborted and if possible migrated
          as  soon as the job gets suspended (manually as well as

     r    A job will be rescheduled (not checkpointed)  when  the
          host  on which the job currently runs went into unknown
          state and the  time  interval  reschedule_unknown  (see
          sge_conf(5)) defined in the global/local cluster confi-
          guration will be exceeded.

     Note, that the functionality of any checkpointing, migration
     or  restart procedures provided by default with the Sun Grid
     Engine distribution as well as the way how they are  invoked
     in the ckpt_command, migr_command or restart_command parame-
     ters of any default checkpointing environments should not be
     changed  or  otherwise  the  functionality  remains the full
     responsibility of the administrator configuring  the  check-
     pointing  environment.   Sun  Grid  Engine  will just invoke
     these procedures and evaluate their exit status. If the pro-
     cedures  do  not  perform  their  tasks  properly or are not
     invoked in a proper fashion, the checkpointing mechanism may
     behave  unexpectedly, Sun Grid Engine has no means to detect

     sge_intro(1), sge_ckpt(1), sge__types(1), qconf(1), qmod(1),
     qsub(1), sge_execd(8).

     See sge_intro(1) for a full statement of rights and  permis-

Man(1) output converted with man2html