JGrid 0.2.1: an RMI-based Java interface for Grid Engine

Overview

This document will attempt to explain what the JGrid package is and does, how to install it, how to configure it and how to develop with it. This software is still in an early state of development. The intention is to provide a transactional Java interface to the Sun Grid Engine product. This is accomplished by intercepting RMI communications on the wire and feeding them through the grid engine. The net result is that a Java client can make use of the grid without knowing anything about the grid. More importantly, because the JGrid package intercept live RMI calls, the Java client is also to pass live objects to the grid and receive live objects in response. What's more, because of how the serialized RMI call is deserialized, there's no JVM startup overhead; the Java process which executes the job on the execution hosts is always running.

Contents

  1. Introduction

  2. Installation

  3. Configuration

  4. Testing

  5. Development

  6. Futures

  7. Disclaimer


The JGrid package is now available via CVS. You can browser the source here.


Introduction

To begin with, this document will explain the concepts behind how JGrid works.

The Architecture

The JGrid package has three major components, the server, the native peer, and the agent.

The Server

The Server is a Java server which uses smart RMI proxies to intercept RMI calls and feed them through the grid. It uses two RMI registries: one for the interface to the client, and one for the interface to the agent.

The Native Peer

The job the server submits to the grid is to run the native peer on the execution host. The native peer is a native binary that connects to the agent and requests the Java job be run. The native peer remains running as long as the Java job runs so that SGE has a handle to use for dealing with the Java job.

The Agent

The agent is a Java server that runs on each execution host. The native peer contacts the agent via a custom protocol (JCEP: Java Compute Engine Protocol) and sends the Java job request. The agent then executes the requested job and returns the results to the server via RMI. The server can then return those results to the client.

Advantages

This architecture has several advantages. The first and foremost is that it allows a Java client to interact with the grid via a well known Java programming model, i.e. RMI, in a transactional context. Also, because RMI allows for code mobility the code to be executed on the client's behalf does not have to be known the server. As long as the class files are posted in an accessible location, the server can dynamically retrieve the binaries need to execute the client's request. The server allows clients to submit both synchronous and asynchronous jobs. The native peer saves the time and memory overhead associated with starting a new VM for every request. Because the compute client is always running, each execution host only has to start the VM once.


Installation

Before you begin, J2SE 1.4 must be installed on all the systems in the grid. First install Sun Grid Engine. Next, make sure each host has access to the JGrid.jar file, either through a shared filesystem or through individual copies. Next choose a host to be the proxy server. The proxy host is the only host through which Java clients will be able to submit jobs. On the proxy host run the server by typing:

java -cp JGrid.jar com.sun.grid.jgrid.proxy.ComputeServer

Next, make sure each execution host has access to the native peer (found in proxy.jar under com/sun/grid/skel/JCEPskeleton) and the skel.sh script. On each execution host run the agent by typing:

java -cp JGrid.jar com.sun.grid.jgrid.server.JCEPHandler -reghost hostname

This, of course, is not likely to work out of the box. In order for JGrid to function, read the next section on configuring the system.

Note that the agents must be started after the server.


Configuration

Configuring the Server

The server has a variety of command line switches to control its behavior. The most useful are: -sub, -skel, and -d.

-sub

The -sub switch allows the command used to submit a job to the queue to be set. In not used, the submission command defaults to "qsub -notify". If path information needs to be included or another command must be used, use -sub. For example:

java com.sun.grid.jgrid.proxy.ComputeServer -sub "/sge/bin/solaris64/qsub -notify"

-skel

The -skel switch allows the command to be executed on the execution host to be set. If not used, the command defaults to "skel.sh". If path information needs to be included or another command must be used, use -skel. For example:

java com.sun.grid.jgrid.proxy.ComputeServer -skel /sge/jgrid/skel.sh

-d

The -d switch allows the job file directory to be set. If not used, the job directory defaults to "ser". If another directory is desired, use -d. For example:

java com.sun.grid.jgrid.proxy.ComputeServer -d /sge/jgrid/jobs

Other Switches

For more information about command lines switches for the server type:

java com.sun.grid.jgrid.proxy.ComputeServer -help

Configuring the Native Peer

The native peer binary cannot be run directly by SGE, so a script, skel.sh, has been included which starts up the skeleton. To change the job file directory used by the skeleton, edit the script and change the WORK_DIR variable.

The native peer can also be used independent of SGE for testing purposes mostly. For more information, run JCEPskeleton (or skel.sh) with no parameters.

Configuring the Agent

The agent has a variety of command line switches to control its behavior. The most useful are: -reghost and -grace.

-reghost

The -reghost switch tells the agent on which host to find the RMI registry containing the result channel. For example:

java com.sun.grid.jgrid.server.JCEPHandler -reghost master.germany.sun.com

-grace

Since JGrid allows the submission of arbitrary jobs, it must be assumed that not all jobs will be well behaved. Since Java does not provide (anymore) a facility for killing an executing thread without killing the whole VM, canceling a job consists of asking the job nicely to stop executing. If it refuses, there's nothing that can be done about it short of stopping the VM. Additionally, the Job class is written such that a call to cancel() only "guarantees" that the job will stop "as soon as possible". Unfortunately, there really is no way to tell a job that is slow to stop from a job that has gone rogue and will never stop. To deal with this problem, the JCEPHandler uses a grace period when shutting down. The JCEPHandler will ask each running job to stop and then wait for a set amount of time before explicitly calling System.exit() to stop the VM. The amount of time to wait is set with the -grace switch. For example, to set the grace period to 30 seconds one would use:

java com.sun.grid.jgrid.server.JCEPHandler -grace 30

Other Switches

For more information about command lines switches for the agent type:

java com.sun.grid.jgrid.server.JCEPHandler -help

Configuring the VM

The server requires access to several directories and much of the network. To avoid copious security errors, a modified security policy file must be specified when running the server. An appropriately modified security file is included in the tar file. If the SGE install and the job file directory are somewhere other than in /sge, the policy file will need to be modified. To set the security file, type:

java -Djava.security.policy=/sge/java.policy com.sun.grid.jgrid.proxy.ComputeServer

Testing

To test if the system is working, do the following:

On the proxy host type:

java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar:client.jar \
     com.sun.grid.jgrid.proxy.ComputeServer -sub /sge/bin/solaris64/qsub \
     -skel "/sge/jgrid/skel.sh submit" -d /sge/jgrid/jobs

Replace the path information with information that is correct for grid. When the server indicates it's ready, on the execution host type:

java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar:client.jar \
     com.sun.grid.jgrid.server.JCEPHandler -reghost proxy_host

Replace proxy_host with the name of the server on which the server is running. When the server indicates it's ready, on a client type:

java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar:client.jar \
     dant.test.ComputeClient proxy_host 1099

Replace proxy_host with the name of the server on which the server is running. If everything is working, the execution host should print "This is a test" in the job output file (skel.sh.o[0-9]+) along with some lifecycle messages and the client should indicate that it received the value "TEST" from the server. If everything isn't working, try rerunning the above commands with the -debug switch to see copious amounts of debug output that will hopefully tell you something useful. You may have to kill the server and agent (CTRL-C) to get them to print their output.

To test if the code mobility is working, do the following:

Make the client.jar available via http. On the proxy host type:

java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar \
     com.sun.grid.jgrid.proxy.ComputeServer -sub /sge/bin/solaris64/qsub \
     -skel "/sge/jgrid/skel.sh submit" -d /sge/jgrid/jobs

Replace the path information with information that is correct for grid. When the server indicates it's ready, on the execution host type:

java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar
     com.sun.grid.jgrid.server.JCEPHandler -reghost proxy_host

Replace proxy_host with the name of the server on which the server in running. When the server indicates it's ready, on a client type:

java -Djava.security.policy=/sge/jgrid/java.policy -Djava.rmi.server.codebase=url \
     -cp JGrid.jar:client.jar dant.test.ComputeClient proxy_host 1099

Replace proxy_host with the name of the server on which the server in running. Replace url with the fully qualified URL where the client.jar is stored. If everything is working, the execution host should print "This is a test" and the client should indicate that it received the value "TEST" from the server. The difference between this test and the previous test is that in the previous test the server and compute agents had client.jar included in their classpaths. In this test, instead of including client.jar in the classpath, client.jar was made available via http and exported by the client via the codebase property. The ComputeTest object that the compute engine executed came from code that was moved dynamically over the network.

Note that whenever you explicitly set the submission command with -sub, if you want your jobs to be suspendable and cancelable, you have to give the command as "/path/qsub -notify". This is required because the native peer needs time to relay the command to the job before itself being suspended or killed. Similarly, whenever setting the native peer via the -skel switch, you must always give it a parameter, normally "submit".


Development

Developing with the JGrid is easy. A developer only needs to know two interfaces: com.sun.grid.jgrid.Computable and com.sun.grid.jgrid.ComputeEngine. A complete API doc is included in the tar.

The Computable interface must be implemented by any object passed to JGrid for execution. Computable inherits from Serializable, so any Computable class must also be Serializable. The Computable interface has five methods: compute(), checkpoint(), cancel(), suspend(), and resume ().

The compute() method is called by the compute engine when the job is assigned to an execution host. If a problem occurs during execution, the Computable class can throw a ComputeException. If everything goes to plan, the compute() method returns a Serializable results object.

The cancel() method is used to stop the job. The job does not have to be stopped when the cancel() method returns. It only has to be in the process of stopping. This is currently to prevent the job from having to implement complicated synchronizations. This will change soon.

The checkpoint() method forces the job to make sure it's serializable attributes are up to date. The assumption is that a job that is checkpointable will do it's work in transient variables and store it's state in non-transient variables, preferably only after a checkpoint() call. In this way, the job's state can be keep consistent. The downside is that any work that gets done between the serialization of the job and the cancelation of the job gets lost.

The suspend() and resume() methods are used to suspend and resume jobs, respectively. The suspend() method, like cancel(), is not required to leave the job in a suspended state, only in a suspending state. Like with cancel(), this will also change soon.

cancel(), suspend(), and checkpoint() can all throw a NotInterruptableException is the job does not implement that facility. A job that cannot be canceled can only be killed by stopping the VM.

The ComputeEngine interface is implemented by the server. The ComputeEngine interface allows clients to submit jobs synchronously or asynchronously. To submit a job synchronously, the client calls the compute() method, passing it a Computable object. The method returns the Serializable results object. To submit a job asynchronously, the client calls the computeAsynch() method, passing a Computable object. The method returns the job id. The client can then call isComplete(), passing the job id, to find out if the job has completed. The client can also call getResults(), passing the job id, or get the Serializable results object if the job has completed.

To get a reference to the ComputeEngine, the client must connect to the RMI registry on the proxy host and do a lookup for "ComputeEngine".

See the API docs for more info.


Futures

I finally have a pretty clear idea of where I'm going with JGrid. The following are the major things on my to-do list, in order of execution:


Disclaimer

JGrid is a product of my creation. Sun in no way supports, acknowledges or lends legitimacy to JGrid. I take no responsibility for any losses resulting from the use of Jgrid. Please direct any questions or comments to me at:

Daniel Templeton
dan.templeton@sun.com
+49 941/3075 220
Dr-Leo-Ritter-Strasse 7, D-93049 Regensburg