Grid Scheduler / Grid Engine Profiling HOWTO

                      Profiling the N1GE 6 System
                               Stephan Grell
                              05 September 2006
0. Introduction:
----------------

   This document gives a brief summary on the profiling capabilities built
   into the Grid Engine Software. The profiling facility is one component in
   the troubleshooting and profiling environment. In addition the profiling
   component Grid Engine provides: health monitoring through qping, monitoring
   for the qmaster, and monitoring for the scheduler. 
   The profiling facility can be used to analyse bottlenecks within the software
   or outside and provide guidence for tuning Grid Engine.

1. Format:
----------

   The format of the profiling output is not fixed and can change from one version
   to another or even inbetween version. It is used as a trouble shooting tool and
   will be adapted to new needs when ever they arise. This document is based on
   the profiling output of SGE 6.0u8.

2. Scheduler:
-------------

   switch:
      qconf -msconf
             params      profile=[0,1]

   output: //spool/qmaster/schedd/messages

   Sample Output:
[1]      PROF: sge_mirror processed 30 events in 0.000 s
[2]      PROF: static urgency took 0.000 s
[3]      PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, 
                                    pass 1: 0.000, pass2: 0.000, calc: 0.000 s
[3a]     PROF: job ticket calculation: init: 0.000 s, pass 0: 0.000 s, 
                                    pass 1: 0.000, pass2: 0.000, calc: 0.000 s
[4]      PROF: normalizing job tickets took 0.000 s
[5]      PROF: create active job orders: 0.000 s
[6]      PROF: job-order calculation took 0.010 s
[7]      PROF: job sorting took 0.000 s
[8]      PROF: job dispatching took 0.000 s (1 fast, 0 comp, 0 pe, 0 res)
[9]      PROF: create pending job orders: 0.000 s
[10]     PROF: scheduled in 0.010 (u 0.010 + s 0.000 = 0.010): 
                         1 sequential, 0 parallel, 4 orders, 2 H, 4 Q, 5 QA, 
                         0 J(qw), 1 J(r), 0 J(s), 0 J(h), 0 J(e), 0 J(x), 1 J(all), 
                         52 C, 4 ACL, 2 PE, 2 U, 2 D, 96 PRJ, 1 ST, 0 CKPT, 0 RU, 
                         1 gMes, 0 jMes, 2/1 pre-send, 0/0/0 pe-alg
[11]     PROF: send orders and cleanup took: 0.100 (u 0.000,s 0.000) s
[12]     PROF: schedd run took: 0.120 s (init: 0.000 s, copy: 0.010 s, run:0.110, 
                                      free: 0.000 s, jobs: 1, categories: 1/1)

   Lines Explained:
   ----------------
   [1]  - Is generated from the event processing.
           events - number of events processed (received from the qmaster)
          time s      - time needed to update the internal data
   
   [2]  - Time needed to compute the urgency policy for all jobs in the system
   [3]  - Time needed to compute the different steps from the ticket policy
          init - time needed to setup the internal data structures
          pass 0 - time for data preparation (tfirst pass)
          pass 1 - time for data preparation (tsecond pass)
          pass 2 - time for data preparation (third pass)
          calc   - time for the ticket calculation
   [3a] - Same as [3]. This step is used to correct the tickets for the running jobs.
   [4]  - Time needed to normalize the tickets to a range of 0 to 1
   [5]  - Time needed to create job update tickets for running jobs, and sending them.
   [6]  - Time needed to compute the priority for each job, includes steps: 2,3,3a,4
   
   [7]  - Time needed to sort the job list to reflect the job priority

   [8]  - Time needed to assign jobs to the compute resources and sending the job start
          orders to the qmaster. The following numbers show the jobs, which were assigned:
            fast - number of sequential jobs without soft requests 
            comp - number of sequential jobs with soft requests
            pe   - number of parallel jobs
            res  - number of reservations
           
   [9]  - Time needed to create the priority update orders for all pending jobs.

   [10] - Time needed to schedule the jobs. The output includes all steps between [2] and
          [9]. The time shows the wallclock time as well as the user and system time. 
           sequential - number of sequential jobs started 
           parallel   - number of parallel jobs started
           orders     - number of orders generated
           H          - number of hosts in the grid
           Q          - number of available queue instances in the grid
           QA         - number of all queue instances in the grid
           J(xxx)     - number of job in the different states in the grid
           C          - number of complex attributes in the grid
           ACL        - number of access list in the grid
           PE         - number of parallel environment objects in the grid
           U          - number of users in the grid
           D          - number of departments in the grid
           PRJ        - number of projects in the grid
           ST         - number of share trees in the grid
           CKPT       - number of check pointing objects in the grid
           RU         - number of jobs running per user
           gMes       - number of global messages created in this scheduling run
           jMes       - number of job related messages created in this scheduling run
          <1/2> pre-send  - shows the orders send during the dispatching run [8]
                           1 - number of orders send
                           2 - number of sends
          <1/2/3> pe-alg  - shows which pe-range algorithm was used during this scheduling run:
                           1 - lowest pe range first
                           2 - binary search
                           3 - highest pe range first
   [11] - Time needed to send the left over orders to the qmaster and wait until they are processed.
          In addition the scheduler cleans up.
   [12] - Time needed for the entire scheduling run ([1] to [11]).
          init         - time needed to setup the scheduler for the 1 run 
          copy         - time needed to replicate the entire dataset
          run          - time needed for the scheduling run
          free         - time needed to free up the duplicated data set
          jobs         - number of jobs in the system, before the duplication / filtering
          categories <1/2> - number of categories in the system
                            1 - number of job categories
                            2 - number of job priority categories
   
3. Qmaster:
-----------
   The qmaster allows to profile the different threads in the qmaster.

   switch:
      qconf -mconf
             qmaster_params  PROF_SIGNAL=[0,1],  - profile the signal handling thread
                             PROF_MESSAGE=[0,1], - profile the message processing thread
                             PROF_DELIVER=[0,1], - profile the event delivery thread
                             PROF_TEVENT=[0,1]   - profile the timed event thread
                             
   output: //spool/qmaster/messages