Migration of Qmaster to Another Machine

Overview

1. Check that the new master host has read/write access
2. Run the migrate script on the new master host
3. Manually migrate the qmaster host
4. Modify the shadow_masters file if necessary
 

1) Check that the new master host has read/write access

The new master host must have read/write access to the qmaster spool directory and common directory as does the current master. If the administrative user is user root (check  the global cluster configuration for the setting of admin_user), you should check if user root can create files in these directory under his user name.
 

2) Run the migrate script on the new master host1,2,3

On the new master host, run the following script as user root:

# /etc/init.d/rcsge -migrate
This will stop sge_qmaster and sge_schedd on the old master host and start them up on the new master host. The master host name listed in the file:
$SGE_ROOT/$SGE_CELL/common/act_qmaster
is automatically changed to the new master host. If qmaster is not running warning messages will be printed and there will be a delay of approx. 60 seconds until qmaster is started on the new host.
 

3) Manually migrate the qmaster host

It is also possible to change the qmaster host manually. Stop the master and scheduler daemon on the current master by running the following command:

# qconf -ks -km
Then edit $SGE_ROOT/$SGE_CELL/common/act_qmaster


In the act_qmaster file, the current hostname will need to be replaced with the new master host's name. This name should be the same as the name returned by the utility gethostname. Run the following command on the new master host:

# $SGE_ROOT/utilbin/<arch>/gethostname
Put this name in the act_qmaster file in place of the old name.

Run rcsge on the new master host

# $SGE_ROOT/default/common/rcsge -qmaster
This will start up sge_qmaster and sge_schedd on the new master host.
 

4) Modify the shadow_masters file if necessary4

Check if the following file exists:

$SGE_ROOT/$CELL/common/shadow_masters
If it does exist you can add the new qmaster host to this file and remove the old master host, depending on your requirements. Then stop and restart the sge_shadowd daemons by issuing the following commands on the respective machines:
/etc/init.d/rcsge -shadowd stop
/etc/init.d/rcsge -shadowd start
(The location of the system wide rcsge startup script may differ on your operating system.)
You can always use $SGE_ROOT/default/common/rcsge)

Notes

1. The migration procedure migrates to the host on which the " rcsge -migrate" command is issued. If the file primary_qmaster exists, any subsequent calls of rcsge on the machine contained in the primary_qmaster file will cause a migration back to that machine. To avoid such a situation, this file needs to be changed or deleted:

$SGE_ROOT/$SGE_CELL/common/primary_qmaster
Existence of the primary_qmaster file does not imply that the qmaster is actually running.

2. Jobs may continue to run during the migration procedure, however it is prudent that the grid should be inactive. While the migration is taking place, any SGE commands, such as qsub or qstat will return an error.

3. If the current qmaster is down, there will be a delay in shutting down the scheduler until it times out waiting for contact with the qmaster.

4. The shadow_masters file has no direct effect on the migration procedure. This file will only exist if one or more shadow masters have been configured. For more information on how to set up shadow masters, see the Howto Setting Up A Shadow Master In Grid Engine.