Relocating Jobs From a User's Workstation



Grid Engine may be configured to relocate a job, useful in the case where the host being used is a desktop system and is only to be used when the owner is not actively working with it. This uses the checkpointing facility of Grid Engine to kill and restart the job elsewhere when the user returns and moves the mouse or presses a key.

To set up this configuration, the following steps are needed:

1) Configure Grid Engine to track interactive idle time
2) Configure the checkpointing interface
3) Add the checkpoint ability to the appropriate queues
4) Set the load threshold in the queues to trigger the relocation

1) Configure Grid Engine to track interactive idle time

Please see the following application note: Tracking Interactive Idle Time

2) Configure the checkpointing interface

The checkpointing interface needs to be created. This can be done in qmon:

  • Click on the "Checkpoint Configuration" icon

  • Click "Add"

  • Name the checkpointing interface (you will be using this name when submitting the job)

  • Select the interface, USERDEFINED

  • Select the appropriate queues to attach the checkpointing interface

  • Leave all other fields blank (if you had an actual checkpointable job, you would have actual entries here)

  • Select "On Job Suspend" to relocate the job after it is suspended

  • Click "Ok"

  • Click "Done" to close this dialog box

3) Add the checkpoint ability to the appropriate queues

In qmon, modify the appropriate queues to give them the checkpointing ability:

  • Click "Queue Control"

  • Select a queue and click "Modify"

  • Under "Type" on the "General Configuration" tab, check "Checkpointing"

  • Click "Ok"

4) Set the load threshold in the queues to trigger the relocation

In qmon "Queue Control", select and modify an appropriate queue. On the "Load/Suspend Thresholds" tab, add to the currently set load threshold by clicking on the heading labelled "Load" and selecting the idle time resource. Enter the desired value under the value column.

When submitting a job that is eligible to be moved, the checkpointing interface needs to be specified. For example, if the interface created above was named "reloc", I would submit the job as such:

qsub -ckpt reloc myjob.sh


The job will be eligible to run in any queue which has the checkpointing ability. Then, if the job is subsequently suspended (as when the queue it is running in is suspended when the user clears the interactive idle time), it will be killed, then requeued.