With Sun HPC
Cluster ToolsTM 5 software
release, Sun CRE (Cluster Runtime Environment) provides close
integration with several distributed resource managers. In that
integration, Sun CRE retains most of its original functions, but
delegates others to the resource manager.
The Sun HPC ClusterTools 5 Software Administrator's Guide provides the detailed description of how it works and how to configure it with Sun Grid Engine, which is based on Grid Engine open source community project.
We recommend the close integration for Sun HPC ClusterTools 5 software because it provides significantly better resource monitoring, control and accounting on Sun MPI processes via Grid Engine commands than the loose integration introduced for Sun HPC ClusterTools 3.1 and 4 releases.
However, we need to provide appropriate suspend and resume methods for Grid Engine queues to run Sun MPI jobs under Grid Engine environment. These suspend and resume methods (scripts) can deliver SIGSTOP and SIGCONT signals to Sun MPI processes when suspending/resuming Sun MPI jobs using Grid Engine commands such as "qmod -s $sge_jid" and "qmod -us $sge_jid". This is all due to the difference between how both Grid Engine and HPC ClusterTools products trap and deliver signals to their child processes. The enhancement package includes the following files:
The README file in this package describes about all other files and provides technical background information about this enhancement and how to configure suspend and resume methods.
A loose integration package distributed with Grid Engine 5.3 is
useful to loosely integrate Grid Engine with Sun
HPC Cluster Tools software with little effort. The package works
for Sun HPC ClusterTools 3.1 and 4 releases.
The loose integration package is located at $SGE_ROOT/mpi/sunhpc/loose-integration directory after installing Grid Engine software. The loose integration package includes all the necessary files and integration script. The README file in the package gives detailed technical description of the loose integration and a step-by-step integration procedure in case anyone wants to implement it manually.