Grid Engine Homepage

Grid Engine Hadoop Integration

The Original Grid Engine Hadoop Integration

The SGE 6.2u5 release uses load sensors to monitor HDFS activities such that Grid Engine is aware of data locality of the MapReduce Hadoop Clusters (there was also an earlier integration).

For details of this integration, see Configuring and Using the Hadoop Integration in the Oracle Grid Engine documentation.

The On-Demand Hadoop Cluster Approach

At UC Cloud Summit 2011, a new Grid Engine Hadoop integration was presented to the public by Prakashan Korambath of UCLA. With this method, Grid Engine handles the resource allocation like it would for any other Grid Engine PE (Parallel Environment) jobs, and the prolog, epilog, and the job work together and create a Hadoop cluster on-demand. The advantage of this approach is that Grid Engine does not need to know the specific details of Hadoop (like speculative execution), all it needs to know is to allocate the best resources for each job. As the Hadoop job scheduler is used, one gets the full functionality of Hadoop.

The integration tarball is available here.


Data Locality-aware On-Demand Hadoop Cluster

Currently under development - stay tuned!