NAME
sched_conf - Sun Grid Engine default scheduler configuration
file
DESCRIPTION
sched_conf defines the configuration file format for Sun
Grid Engine's scheduler. In order to modify the configura-
tion, use the graphical user's interface qmon(1) or the -
msconf option of the qconf(1) command. A default configura-
tion is provided together with the Sun Grid Engine distribu-
tion package.
Note, Sun Grid Engine allows backslashes (\) be used to
escape newline (\newline) characters. The backslash and the
newline are replaced with a space (" ") character before any
interpretation.
FORMAT
The following parameters are recognized by the Sun Grid
Engine scheduler if present in sched_conf:
algorithm
Note: Deprecated, may be removed in future release.
Allows for the selection of alternative scheduling algo-
rithms.
Currently default is the only allowed setting.
load_formula
A simple algebraic expression used to derive a single
weighted load value from all or part of the load parameters
reported by sge_execd(8) for each host and from all or part
of the consumable resources (see complex(5)) being main-
tained for each host. The load formula expression syntax is
that of a summation weighted load values, that is:
{w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]
Note, no blanks are allowed in the load formula.
The load values and consumable resources (load_val1, ...)
are specified by the name defined in the complex (see com-
plex(5)).
Note: Administrator defined load values (see the load_sensor
parameter in sge_conf(5) for details) and consumable
resources available for all hosts (see complex(5)) may be
used as well as Sun Grid Engine default load parameters.
The weighting factors (w1, ...) are positive integers. After
the expression is evaluated for each host the results are
assigned to the hosts and are used to sort the hosts
corresponding to the weighted load. The sorted host list is
used to sort queues subsequently.
The default load formula is "np_load_avg".
job_load_adjustments
The load, which is imposed by the Sun Grid Engine jobs run-
ning on a system varies in time, and often, e.g. for the CPU
load, requires some amount of time to be reported in the
appropriate quantity by the operating system. Consequently,
if a job was started very recently, the reported load may
not provide a sufficient representation of the load which is
already imposed on that host by the job. The reported load
will adapt to the real load over time, but the period of
time, in which the reported load is too low, may already
lead to an oversubscription of that host. Sun Grid Engine
allows the administrator to specify job_load_adjustments
which are used in the Sun Grid Engine scheduler to compen-
sate for this problem.
The job_load_adjustments are specified as a comma separated
list of arbitrary load parameters or consumable resources
and (separated by an equal sign) an associated load correc-
tion value. Whenever a job is dispatched to a host by the
scheduler, the load parameter and consumable value set of
that host is increased by the values provided in the
job_load_adjustments list. These correction values are
decayed linearly over time until after
load_adjustment_decay_time from the start the corrections
reach the value 0. If the job_load_adjustments list is
assigned the special denominator NONE, no load corrections
are performed.
The adjusted load and consumable values are used to compute
the combined and weighted load of the hosts with the
load_formula (see above) and to compare the load and consum-
able values against the load threshold lists defined in the
queue configurations (see queue_conf(5)). If the
load_formula consists simply of the default CPU load average
parameter np_load_avg, and if the jobs are very compute
intensive, one might want to set the job_load_adjustments
list to np_load_avg=1.00, which means that every new job
dispatched to a host will require 100 % CPU time, and thus
the machine's load is instantly increased by 1.00.
load_adjustment_decay_time
The load corrections in the "job_load_adjustments" list
above are decayed linearly over time from the point of the
job start, where the corresponding load or consumable param-
eter is raised by the full correction value, until after a
time period of "load_adjustment_decay_time", where the
correction becomes 0. Proper values for
"load_adjustment_decay_time" greatly depend upon the load or
consumable parameters used and the specific operating
system(s). Therefore, they can only be determined on-site
and experimentally. For the default np_load_avg load param-
eter a "load_adjustment_decay_time" of 7 minutes has proven
to yield reasonable results.
maxujobs
The maximum number of jobs any user may have running in a
Sun Grid Engine cluster at the same time. If set to 0
(default) the users may run an arbitrary number of jobs.
schedule_interval
At the time the scheduler thread initially registers at the
event master thread in sge_qmaster(8)process
schedule_interval is used to set the time interval in which
the event master thread sends scheduling event updates to
the scheduler thread. A scheduling event is a status change
that has occurred within sge_qmaster(8) which may trigger or
affect scheduler decisions (e.g. a job has finished and thus
the allocated resources are available again).
In the Sun Grid Engine default scheduler the arrival of a
scheduling event report triggers a scheduler run. The
scheduler waits for event reports otherwise.
Schedule_interval is a time value (see queue_conf(5) for a
definition of the syntax of time values).
queue_sort_method
This parameter determines in which order several criteria
are taken into account to product a sorted queue list.
Currently, two settings are valid: seqno and load. However
in both cases, Sun Grid Engine attempts to maximize the
number of soft requests (see qsub(1) -s option) being ful-
filled by the queues for a particular as the primary cri-
terion.
Then, if the queue_sort_method parameter is set to seqno,
Sun Grid Engine will use the seq_no parameter as configured
in the current queue configurations (see queue_conf(5)) as
the next criterion to sort the queue list. The load_formula
(see above) has only a meaning if two queues have equal
sequence numbers. If queue_sort_method is set to load the
load according the load_formula is the criterion after max-
imizing a job's soft requests and the sequence number is
only used if two hosts have the same load. The sequence
number sorting is most useful if you want to define a fixed
order in which queues are to be filled (e.g. the cheapest
resource first).
The default for this parameter is load.
halftime
When executing under a share based policy, the scheduler
"ages" (i.e. decreases) usage to implement a sliding window
for achieving the share entitlements as defined by the share
tree. The halftime defines the time interval in which accu-
mulated usage will have been decayed to half its original
value. Valid values are specified in hours or according to
the time format as specified in queue_conf(5).
If the value is set to 0, the usage is not decayed.
usage_weight_list
Sun Grid Engine accounts for the consumption of the
resources CPU-time, memory and IO to determine the usage
which is imposed on a system by a job. A single usage value
is computed from these three input parameters by multiplying
the individual values by weights and adding them up. The
weights are defined in the usage_weight_list. The format of
the list is
cpu=wcpu,mem=wmem,io=wio
where wcpu, wmem and wio are the configurable weights. The
weights are real number. The sum of all tree weights should
be 1.
compensation_factor
Determines how fast Sun Grid Engine should compensate for
past usage below of above the share entitlement defined in
the share tree. Recommended values are between 2 and 10,
where 10 means faster compensation.
weight_user
The relative importance of the user shares in the functional
policy. Values are of type real.
weight_project
The relative importance of the project shares in the func-
tional policy. Values are of type real.
weight_department
The relative importance of the department shares in the
functional policy. Values are of type real.
weight_job
The relative importance of the job shares in the functional
policy. Values are of type real.
weight_tickets_functional
The maximum number of functional tickets available for dis-
tribution by Sun Grid Engine. Determines the relative impor-
tance of the functional policy. See under sge_priority(5)
for an overview on job priorities.
weight_tickets_share
The maximum number of share based tickets available for dis-
tribution by Sun Grid Engine. Determines the relative impor-
tance of the share tree policy. See under sge_priority(5)
for an overview on job priorities.
weight_deadline
The weight applied on the remaining time until a jobs latest
start time. Determines the relative importance of the
deadline. See under sge_priority(5) for an overview on job
priorities.
weight_waiting_time
The weight applied on the jobs waiting time since submis-
sion. Determines the relative importance of the waiting
time. See under sge_priority(5) for an overview on job
priorities.
weight_urgency
The weight applied on jobs normalized urgency when determin-
ing priority finally used. Determines the relative impor-
tance of urgency. See under sge_priority(5) for an overview
on job priorities.
weight_priority
The weight applied on jobs normalized POSIX priority when
determining priority finally used. Determines the relative
importance of POSIX priority. See under sge_priority(5) for
an overview on job priorities.
weight_ticket
The weight applied on normalized ticket amount when deter-
mining priority finally used. Determines the relative
importance of the ticket policies. See under sge_priority(5)
for an overview on job priorities.
flush_finish_sec
The parameters are provided for tuning the system's schedul-
ing behavior. By default, a scheduler run is triggered in
the scheduler interval. When this parameter is set to 1 or
larger, the scheduler will be triggered x seconds after a
job has finished. Setting this parameter to 0 disables the
flush after a job has finished.
flush_submit_sec
The parameters are provided for tuning the system's schedul-
ing behavior. By default, a scheduler run is triggered in
the scheduler interval. When this parameter is set to 1 or
larger, the scheduler will be triggered x seconds after a
job was submitted to the system. Setting this parameter to 0
disables the flush after a job was submitted.
schedd_job_info
The default scheduler can keep track why jobs could not be
scheduled during the last scheduler run. This parameter
enables or disables the observation. The value true enables
the monitoring false turns it off.
It is also possible to activate the observation only for
certain jobs. This will be done if the parameter is set to
job_list followed by a comma separated list of job ids.
The user can obtain the collected information with the com-
mand qstat -j.
params
This is foreseen for passing additional parameters to the
Sun Grid Engine scheduler. The following values are recog-
nized:
DURATION_OFFSET
If set, overrides the default of value 60 seconds.
This parameter is used by the Sun Grid Engine scheduler
when planning resource utilization as the delta between
net job runtimes and total time until resources become
available again. Net job runtime as specified with -l
h_rt=... or -l s_rt=... or default_duration always
differs from total job runtime due to delays before and
after actual job start and finish. Among the delays
before job start is the time until the end of a
schedule_interval, the time it takes to deliver a job
to sge_execd(8) and the delays caused by prolog in
queue_conf(5) , start_proc_args in sge_pe(5) and
starter_method in queue_conf(5) (notify,
terminate_method or checkpointing), procedures run
after actual job finish, such as stop_proc_args in
sge_pe(5) or epilog in queue_conf(5) , and the delay
until a new schedule_interval.
If the offset is too low, resource reservations (see
max_reservation) can be delayed repeatedly due to an
overly optimistic job circulation time.
JC_FILTER
Note: Deprecated, may be removed in future release.
If set to true, the scheduler limits the number of jobs
it looks at during a scheduling run. At the beginning
of the scheduling run it assigns each job a specific
category, which is based on the job's requests, prior-
ity settings, and the job owner. All scheduling poli-
cies will assign the same importance to each job in one
category. Therefore the number of jobs per category
have a FIFO order and can be limited to the number of
free slots in the system.
A exception are jobs, which request a resource reserva-
tion. They are included regardless of the number of
jobs in a category.
This setting is turned off per default, because in very
rare cases, the scheduler can make a wrong decision. It
is also advised to turn report_pjob_tickets off. Other-
wise qstat -ext can report outdated ticket amounts. The
information shown with a qstat -j for a job, that was
excluded in a scheduling run, is very limited.
PROFILE
If set equal to 1, the scheduler logs profiling infor-
mation summarizing each scheduling run.
MONITOR
If set equal to 1, the scheduler records information
for each scheduling run allowing to reproduce job
resources utilization in the file
<sge_root>/<cell>/common/schedule.
PE_RANGE_ALG
This parameter sets the algorithm for the pe range com-
putation. The default is automatic, which means that
the scheduler will select the best one, and it should
not be necessary to change it to a different setting in
normal operation. If a custom setting is needed, the
following values are available:
auto : the scheduler selects the best algorithm
least : starts the resource matching with the
lowest slot amount first
bin : starts the resource matching in the middle
of the pe slot range
highest : starts the resource matching with the
highest slot amount first
Changing params will take immediate effect. The default for
params is none.
reprioritize_interval
Interval (HH:MM:SS) to reprioritize jobs on the execution
hosts based on the current ticket amount for the running
jobs. If the interval is set to 00:00:00 the reprioritiza-
tion is turned off. The default value is 00:00:00. The
reprioritization tickets are calculated by the scheduler and
update events for running jobs are only sent after the
scheduler calculated new values. How often the schedule
should calculate the tickets is defined by the
reprioritize_interval. Because the scheduler is only trig-
gered in a specific interval (scheduler_interval) this means
the reprioritize_interval has only a meaning if set greater
than the scheduler_interval. For example, if the
scheduler_interval is 2 minutes and reprioritize_interval is
set to 10 seconds, this means the jobs get re-prioritized
every 2 minutes.
report_pjob_tickets
This parameter allows to tune the system's scheduling run
time. It is used to enable / disable the reporting of pend-
ing job tickets to the qmaster. It does not influence the
tickets calculation. The sort order of jobs in qstat and
qmon is only based on the submit time, when the reporting is
turned off.
The reporting should be turned off in a system with a very
large amount of jobs by setting this parameter to "false".
halflife_decay_list
The halflife_decay_list allows to configure different decay
rates for the "finished_jobs usage types, which is used in
the pending job ticket calculation to account for jobs which
have just ended. This allows the user the pending jobs algo-
rithm to count finished jobs against a user or project for a
configurable decayed time period. This feature is turned off
by default, and the halftime is used instead.
The halflife_decay_list also allows one to configure dif-
ferent decay rates for each usage type being tracked (cpu,
io, and mem). The list is specified in the following format:
<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]
<Usage_TYPE> can be one of the following: cpu, io, or mem.
<TIME> can be -1, 0 or a timespan specified in minutes. If
<TIME> is -1, only the usage of currently running jobs is
used. 0 means that the usage is not decayed.
policy_hierarchy
This parameter sets up a dependency chain of ticket based
policies. Each ticket based policy in the dependency chain
is influenced by the previous policies and influences the
following policies. A typical scenario is to assign pre-
cedence for the override policy over the share-based policy.
The override policy determines in such a case how share-
based tickets are assigned among jobs of the same user or
project. Note that all policies contribute to the ticket
amount assigned to a particular job regardless of the policy
hierarchy definition. Yet the tickets calculated in each of
the policies can be different depending on
"POLICY_HIERARCHY".
The "POLICY_HIERARCHY" parameter can be a up to 3 letter
combination of the first letters of the 3 ticket based poli-
cies S(hare-based), F(unctional) and O(verride). So a value
"OFS" means that the override policy takes precedence over
the functional policy, which finally influences the share-
based policy. Less than 3 letters mean that some of the
policies do not influence other policies and also are not
influenced by other policies. So a value of "FS" means that
the functional policy influences the share-based policy and
that there is no interference with the other policies.
The special value "NONE" switches off policy hierarchies.
share_override_tickets
If set to "true" or "1", override tickets of any override
object instance are shared equally among all running jobs
associated with the object. The pending jobs will get as
many override tickets, as they would have, when they were
running. If set to "false" or "0", each job gets the full
value of the override tickets associated with the object.
The default value is "true".
share_functional_shares
If set to "true" or "1", functional shares of any functional
object instance are shared among all the jobs associated
with the object. If set to "false" or "0", each job associ-
ated with a functional object, gets the full functional
shares of that object. The default value is "true".
max_functional_jobs_to_schedule
The maximum number of pending jobs to schedule in the func-
tional policy. The default value is 200.
max_pending_tasks_per_job
The maximum number of subtasks per pending array job to
schedule. This parameter exists in order to reduce schedul-
ing overhead. The default value is 50.
max_reservation
The maximum number of reservations scheduled within a
schedule interval. When a runnable job can not be started
due to a shortage of resources a reservation can be
scheduled instead. A reservation can cover consumable
resources with the global host, any execution host and any
queue. For parallel jobs reservations are done also for
slots resource as specified in sge_pe(5). As job runtime
the maximum of the time specified with -l h_rt=... or -l
s_rt=... is assumed. For jobs that have neither of them the
default_duration is assumed. Reservations prevent jobs of
lower priority as specified in sge_priority(5) from utiliz-
ing the reserved resource quota during the time of reserva-
tion. Jobs of lower priority are allowed to utilize those
reserved resources only if their prospective job end is
before the start of the reservation (backfilling). Reserva-
tion is done only for non-immediate jobs (-now no) that
request reservation (-R y). If max_reservation is set to "0"
no job reservation is done.
Note, that reservation scheduling can be performance consum-
ing and hence reservation scheduling is switched off by
default. Since reservation scheduling performance consump-
tion is known to grow with the number of pending jobs, the
use of -R y option is recommended only for those jobs actu-
ally queuing for bottleneck resources. Together with the
max_reservation parameter this technique can be used to nar-
row down performance impacts.
default_duration
When job reservation is enabled through max_reservation
sched_conf(5) parameter the default duration is assumed as
runtime for jobs that have neither -l h_rt=... nor -l
s_rt=... specified. In contrast to a h_rt/s_rt time limit
the default_duration is not enforced.
FILES
<sge_root>/<cell>/common/sched_configuration
scheduler thread configuration
SEE ALSO
sge_intro(1), qalter(1), qconf(1), qstat(1), qsub(1), com-
plex(5), queue_conf(5), sge_execd(8), sge_qmaster(8), Sun
Grid Engine Installation and Administration
COPYRIGHT
See sge_intro(1) for a full statement of rights and permis-
sions.
Man(1) output converted with
man2html