1. Check that the new master host has read/write access
2. Run the migrate script on the new master host
3. Manually migrate the qmaster host
4. Modify the shadow_masters
file if necessary
1) Check that the new master host has read/write access
The new master host must have read/write access to the qmaster spool
directory and common directory
as does the current master. If the administrative user is user root
(check the global cluster configuration for the setting of admin_user),
you should check if user root
can create files in these directory under his user name.
2) Run the migrate script on the new master host1,2,3
On the new master host, run the following script as user root:
# /etc/init.d/rcsge -migrateThis will stop sge_qmaster and sge_schedd on the old master host and start them up on the new master host. The master host name listed in the file:
$SGE_ROOT/$SGE_CELL/common/act_qmasteris automatically changed to the new master host. If qmaster is not running warning messages will be printed and there will be a delay of approx. 60 seconds until qmaster is started on the new host.
3) Manually migrate the qmaster host
It is also possible to change the qmaster host manually. Stop the master and scheduler daemon on the current master by running the following command:
# qconf -ks -km
Then edit $SGE_ROOT/$SGE_CELL/common/act_qmaster
In the act_qmaster
file, the current hostname will need to be replaced with the new master
host's name. This name should be the same as the name returned by the utility gethostname.
Run the following command on the new master host:
# $SGE_ROOT/utilbin/<arch>/gethostnamePut this name in the act_qmaster file in place of the old name.
Run rcsge on the new master host
# $SGE_ROOT/default/common/rcsge -qmasterThis will start up sge_qmaster and sge_schedd on the new master host.
4) Modify the shadow_masters file if necessary4
Check if the following file exists:
$SGE_ROOT/$CELL/common/shadow_mastersIf it does exist you can add the new qmaster host to this file and remove the old master host, depending on your requirements. Then stop and restart the sge_shadowd daemons by issuing the following commands on the respective machines:
/etc/init.d/rcsge -shadowd stop /etc/init.d/rcsge -shadowd start
You can always use $SGE_ROOT/default/common/rcsge)
Notes
1. The migration procedure migrates to the host on which the " rcsge -migrate" command is issued. If the file primary_qmaster exists, any subsequent calls of rcsge on the machine contained in the primary_qmaster file will cause a migration back to that machine. To avoid such a situation, this file needs to be changed or deleted:
$SGE_ROOT/$SGE_CELL/common/primary_qmasterExistence of the primary_qmaster file does not imply that the qmaster is actually running.
2. Jobs may continue to run during the migration procedure, however it is prudent that the grid should be inactive. While the migration is taking place, any SGE commands, such as qsub or qstat will return an error.
3. If the current qmaster is down, there will be a delay in shutting down the scheduler until it times out waiting for contact with the qmaster.
4. The shadow_masters
file has no direct effect on the migration procedure. This file will only
exist if one or more shadow masters have been configured. For more information
on how to set up shadow masters, see the Howto Setting
Up A Shadow Master In Grid Engine.