Hey
have you taken a look at the new code to be released in 2.7.0??
please do, you can get a head start on writing the backend plugin. the
ERT calc is now done by a generic (honest) framework that is batch
system independent. I hope it's a good one, three years of work have
gone into it.
The doc on what the backend plugin needs to do is included as attachment.
JT
David McBride wrote:
> On Fri, 2005-12-16 at 12:56 +0100, Jeff Templon wrote:
>
>
>>On the other hand, if there are no waiting jobs, then as long as there
>>is even one free CPU (the second kind, not the first kind ;-) then ERT
>>should be zero, regardless of how many jobs there are running.
>
>
> Yes. This is something that I'd implemented in my SGE adaptor, too.
>
> (I'd stick a reference to the code here, but the SGE adaptor does a
> number of interesting backflips that I suspect would just be confusing.
> For example, none of the queues that it is reporting on actually
> physically exist..)
>
> Cheers,
> David
The output of the LRMS-specific part needs to contain a snapshot
of the state of the LRMS. This state should be as faithful as
possible; 'massaging' of the state should be left to higher-level
programs such as the ERT system (which handles mapping of unix
group names to VO names). Placing the massaging at a higher
level and keeping the LRMS-specific part pristine has two main
values:
1) the massaging is uniform across LRMS types, so one can at least
hope that there won't be some LRMS bias in the estimates
2) if the LRMS tool reports the real information, it might well be
useful for some purpose besides predicting ERTs.
==========================================================
The required format of this file is described below.
EXAMPLE FILE
nactive 240
nfree 191
now 1119073982
schedCycle 120
{'queue': 'atlas', 'start': 1119073982.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 345600.0, 'qtime': 1119073781.0, 'jobid': '612049.tbn20.nikhef.nl'}
{'queue': 'qlong', 'start': 1119060911.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119060774.0, 'jobid': '612043.tbn20.nikhef.nl'}
{'queue': 'atlas', 'start': 1119060910.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 345600.0, 'qtime': 1119060759.0, 'jobid': '612039.tbn20.nikhef.nl'}
{'queue': 'qlong', 'start': 1119136200.0, 'state': 'running', 'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, 'qtime': 1119135972.0, 'jobid': '612176.tbn20.nikhef.nl'}
{'queue': 'dzero', 'start': 1119268211.0, 'state': 'running', 'group': 'dzero', 'user': 'dzero004', 'maxwalltime': 345600.0, 'qtime': 1119268047.0, 'jobid': '612241.tbn20.nikhef.nl'}
===========================
The last structure between "{}" characters is repeated one line
for each job currently either executing or waiting in the queue. Here
are some explanations for the semantics of the values:
nactive is the number of job slots that are actually capable of
running jobs at the snapshot time (let's call the snapshot time t0 for
brevity). by 'actually capable of running jobs' i mean that at t0,
what is the maximum number of jobs that could be running on the
system. so nactive counts all jobs slots, empty or occupied, but does
not count the job slots on CPUs that are 'down' or 'offline'. So it's
not the theoretical maximum number of job slots in your farm (unless
ALL your WNs are working), it's the number that are 'up'.
nfree is the number of these active job slots that at t0 do not have
an assigned job. they can potentially accept a new job at t0 (or
at least at the start of the next scheduling cycle).
Note these numbers don't have anything to do with VOs (unless each
node happens to be exclusively assigned to a single VO). They are
aggregates of all job slots that are being controlled by a single
LRMS.
'now' is a timestamp in seconds of when the queue was inspected. The
only constraint here is that 'now' has to be in the same units, and
have the same zero reference, as do all the times in the per-job lines
(like 'qtime' or 'start'). In the PBS version provided, 'now'
is in local time seconds, meaning seconds since midnight
Jan 1st 1970 local time. Again as long as the units are seconds
and all times have the same reference point, the actual reference
point does not matter.
'schedCycle' is the cycle time of your batch scheduler; how often does
it start a new scheduling pass? As of this writing at NIKHEF it is
120 seconds, meaning a new scheduling attempt is started every 120
seconds.
Each line thereafter reports the info for a single job.
{'queue': 'qlong', 'start': 1119060911.0, 'state': 'running', \
'group': 'atlsgm', 'user': 'atlsm003', 'maxwalltime': 259200.0, \
'qtime': 1119060774.0, 'jobid': '612043.tbn20.nikhef.nl'}
This has a structure { 'key1' : 'attr1', 'key2' : 'attr2' } and
is written in this particular format because it is the string
representation of a python 'dictionary' (same as perl 'hash'),
making the input parsing for the other part very easy. The
order of the various keys is irrelevant, you could write
{'key2' : 'attr2', 'key1' : 'attr1' } if you wanted.
Not all the fields are required but they should be consistent.
All jobs should have a 'qtime' since they must have entered the
queue at some point. If a job is in state 'running' it better
have a 'start' time; if it is 'queued' then 'start' should be
absent.
Here is a bit of explanation of the various fields:
In the example above, the local PBS jobid is 612043.tbn20.nikhef.nl ;
this just has to be a unique string (no two jobs should have the same
string).
qtime is the timestamp when it entered the queue, with the same ref
point as 'now'. now - qtime will tell you how long it has been since
the job entered the queue (submitted). maxwalltime is the maximum
amount of real time the execution of a job in this queue may take in
seconds). 'user' and 'group' are the pool account ids under which the
job runs. For the current implementation we assume that group name ==
VO name.
'state' can be either 'queued', 'running', 'pending', or 'done'.
'pending' means it is in the queue but has been placed on 'hold'.
'start' is the time stamp for when the job actually started to
execute. Again needs to be measured in the same coords as 'now'.
Finally 'queue' gives the name of the queue in which this job is
running (like 'qlong').
|