NAME
     Sun Grid Engine Checkpointing - the Sun Grid  Engine  check-
     pointing mechanism and checkpointing support

DESCRIPTION
     Sun Grid Engine supports two levels  of  checkpointing:  the
     user  level  and  a  operating  system  provided transparent
     level. User  level  checkpointing  refers  to  applications,
     which do their own checkpointing by writing restart files at
     certain times or algorithmic steps and by properly  process-
     ing these restart files when restarted.

     Transparent checkpointing has to be provided by the  operat-
     ing system and is usually integrated in the operating system
     kernel. An example for  a  kernel  integrated  checkpointing
     facility is the Hibernator package from Softway for SGI IRIX
     platforms.

     Checkpointing jobs need to be identified  to  the  Sun  Grid
     Engine  system by using the -ckpt option of the qsub1() com-
     mand. The argument to this flag refers to a so called check-
     pointing  environment,  which  defines the attributes of the
     checkpointing method  to  be  used  (see  checkpoint5()  for
     details).   Checkpointing  environments  are  setup  by  the
     qconf1() options -ackpt,  -dckpt,  -mckpt  and  -sckpt.  The
     qsub1()  option  -c can be used to overwrite the when attri-
     bute for the referenced checkpointing environment.

     If a queue is of the type CHECKPOINTING, jobs need  to  have
     the checkpointing attribute flagged (see the -ckpt option to
     qsub1()) to be permitted to run in such a queue. As  opposed
     to  the  behavior for regular batch jobs, checkpointing jobs
     are aborted under conditions, for which batch or interactive
     jobs are suspended or even stay unaffected. These conditions
     are:

     o  Explicit suspension of the queue or job  via  qmod1()  by
        the  cluster  administration  or  a  queue owner if the x
        occasion specifier (see qsub1() -c and checkpoint5()) was
        assigned to the job.

     o  A load average value exceeding the suspend  threshold  as
        configured    for    the    corresponding   queues   (see
        queue_conf5().)

     o  Shutdown  of  the  Sun  Grid  Engine   execution   daemon
        sge_execd8() being responsible for the checkpointing job.

     After abortion, the jobs will migrate to other queues unless
     they  were  submitted  to  one specific queue by an explicit
     user request.  The migration of jobs leads to a dynamic load
     balancing.   Note:  The  abortion  of checkpointed jobs will
     free all resources (memory, swap space) which the job  occu-
     pies  at  that  time.  This  is opposed to the situation for
     suspended regular jobs, which still cover swap space.

RESTRICTIONS
     When a job migrates to a queue on another machine at present
     no files are transferred automatically to that machine. This
     means that all files which are used  throughout  the  entire
     job  including  restart files, executables and scratch files
     must be visible  or  transferred  explicitly  (e.g.  at  the
     beginning of the job script).

     There are also some practical limitations regarding  use  of
     disk space for transparently checkpointing jobs. Checkpoints
     of a  transparently  checkpointed  application  are  usually
     stored  in  a  checkpoint file or directory by the operating
     system. The file or directory contains all the  text,  data,
     and  stack space for the process, along with some additional
     control information. This means jobs which use a very  large
     virtual  address  space  will generate very large checkpoint
     files. Also the workstations on which the jobs will actually
     execute  may  have  little  free  disk space. Thus it is not
     always possible to transfer a transparent checkpointing  job
     to  a machine, even though that machine is idle. Since large
     virtual memory jobs must wait for a  machine  that  is  both
     idle,  and  has a sufficient amount of free disk space, such
     jobs may suffer long turnaround times.

SEE ALSO
     sge_intro1(,) qconf1(,) qmod1(,) qsub1(,) checkpoint5(,) Sun
     Grid  Engine Installation and Administration Sun Grid Engine
     User's Guide

COPYRIGHT
     See sge_intro1() for a full statement of rights and  permis-
     sions.

















Man(1) output converted with man2html