Grid Engine Hadoop Integration

The Original Grid Engine Hadoop Integration

The SGE 6.2u5 release uses load sensors to monitor HDFS activities such that Grid Engine is aware of data locality of the MapReduce Hadoop Clusters (there was also an earlier integration).

For details of this integration, see Configuring and Using the Hadoop Integration in the Oracle Grid Engine documentation.

The On-Demand Hadoop Cluster Approach

At UC Cloud Summit 2011, a new Grid Engine Hadoop integration was presented to the public by Prakashan Korambath of UCLA. With this method, Grid Engine handles the resource allocation like it would for any other Grid Engine PE (Parallel Environment) jobs, and the prolog, epilog, and the job work together and create a Hadoop cluster on-demand.

The prolog creates the config files needed to create a Hadoop cluster. It only uses nodes allocated for the job.
The job itself starts the HDFS & Hadoop services, runs the MapReduce operations, and then shuts the services when it is done.
The epilog cleans up the files created by the prolog.

The advantage of this approach is that Grid Engine does not need to know the specific details of Hadoop (like speculative execution), all it needs to know is to allocate the best resources for each job. As the Hadoop job scheduler is used, one gets the full functionality of Hadoop.

The integration tarball is available here.

Notes:

The integration assumes a PE named shared, which should be a Grid Engine Parallel Environment that can utilize processors on more than one host.
A hadoop.q is needed.
Further instructions are in the README.

Data Locality-aware On-Demand Hadoop Cluster

Currently under development - stay tuned!