NAME
Sun Grid Engine Checkpointing - the Sun Grid Engine check-
pointing mechanism and checkpointing support
DESCRIPTION
Sun Grid Engine supports two levels of checkpointing: the
user level and a operating system provided transparent
level. User level checkpointing refers to applications,
which do their own checkpointing by writing restart files at
certain times or algorithmic steps and by properly process-
ing these restart files when restarted.
Transparent checkpointing has to be provided by the operat-
ing system and is usually integrated in the operating system
kernel. An example for a kernel integrated checkpointing
facility is the Hibernator package from Softway for SGI IRIX
platforms.
Checkpointing jobs need to be identified to the Sun Grid
Engine system by using the -ckpt option of the qsub1() com-
mand. The argument to this flag refers to a so called check-
pointing environment, which defines the attributes of the
checkpointing method to be used (see checkpoint5() for
details). Checkpointing environments are setup by the
qconf1() options -ackpt, -dckpt, -mckpt and -sckpt. The
qsub1() option -c can be used to overwrite the when attri-
bute for the referenced checkpointing environment.
If a queue is of the type CHECKPOINTING, jobs need to have
the checkpointing attribute flagged (see the -ckpt option to
qsub1()) to be permitted to run in such a queue. As opposed
to the behavior for regular batch jobs, checkpointing jobs
are aborted under conditions, for which batch or interactive
jobs are suspended or even stay unaffected. These conditions
are:
o Explicit suspension of the queue or job via qmod1() by
the cluster administration or a queue owner if the x
occasion specifier (see qsub1() -c and checkpoint5()) was
assigned to the job.
o A load average value exceeding the suspend threshold as
configured for the corresponding queues (see
queue_conf5().)
o Shutdown of the Sun Grid Engine execution daemon
sge_execd8() being responsible for the checkpointing job.
After abortion, the jobs will migrate to other queues unless
they were submitted to one specific queue by an explicit
user request. The migration of jobs leads to a dynamic load
balancing. Note: The abortion of checkpointed jobs will
free all resources (memory, swap space) which the job occu-
pies at that time. This is opposed to the situation for
suspended regular jobs, which still cover swap space.
RESTRICTIONS
When a job migrates to a queue on another machine at present
no files are transferred automatically to that machine. This
means that all files which are used throughout the entire
job including restart files, executables and scratch files
must be visible or transferred explicitly (e.g. at the
beginning of the job script).
There are also some practical limitations regarding use of
disk space for transparently checkpointing jobs. Checkpoints
of a transparently checkpointed application are usually
stored in a checkpoint file or directory by the operating
system. The file or directory contains all the text, data,
and stack space for the process, along with some additional
control information. This means jobs which use a very large
virtual address space will generate very large checkpoint
files. Also the workstations on which the jobs will actually
execute may have little free disk space. Thus it is not
always possible to transfer a transparent checkpointing job
to a machine, even though that machine is idle. Since large
virtual memory jobs must wait for a machine that is both
idle, and has a sufficient amount of free disk space, such
jobs may suffer long turnaround times.
SEE ALSO
sge_intro1(,) qconf1(,) qmod1(,) qsub1(,) checkpoint5(,) Sun
Grid Engine Installation and Administration Sun Grid Engine
User's Guide
COPYRIGHT
See sge_intro1() for a full statement of rights and permis-
sions.
Man(1) output converted with
man2html