I fully agree that Item A (submission) and Items B/C (site selection)
are different.
Regarding what should be passed to the LRMS, the following list was
produced at Spring HEPiX 2005.
Job Submission to local batch system does not forward critical
submission parameters.
* Job Name
* CPU Time required
* Wall Clock Time required
* Total RAM required
* Swap space required
* Temporary disk space required
* Specific operating system required
* Speed of processor required
Less critical parameters are
* External network access
* Project
We had further discussions with Francesco Prelz at the HEPiX Fall 2005
in SLAC. The overall approach favoured there was that the class ads
were translated into a standard format (e.g. variable operator value
such as MaxCPUTime > 30) and a batch specific plugin translates these to
the local batch system bsub/qsub.
Item B is partly covered in the Glue 1.2 schema in that there are
subclusters defined. However, there is no reporting of free slots/ERT
etc. so this means that the data cannot be easily used for scheduling.
I think this should be included into the Glue 2.0 requirements.
If we get Item B covered with an improved Glue schema, this will reduce
the chances of submission to sites with overloaded high class workers.
Tim
Jeff Templon wrote:
> Yo
>
> There are at least three problems trying to be solved here, which is
> responsible for a lot of the confusion. it would be very good to try
> to continue this discussion in a way that makes it clear which
> problem(s) is being addressed by the proposed solution or comment.
>
> A. if the USER specifies some REQUIREMENTS at SUBMIT TIME, how can we
> have the GRID LAYER pass these down to the LRMS layer??
>
> A concrete example: user specifies something like
>
> other.GlueCEPolicyMaxCPUTime > 30
>
> in the JDL. When the job lands on a site and is submitted to the LRMS
> by the grid layer, the LRMS would be told that the job requires no
> more than 30 minutes of CPU time, for example by doing
>
> qsub -l cput=30:00 job_script
>
> or perhaps doing a bare submit and using 'qalter' to tell it about the
> cputime requirement.
>
> B. if the SITE has a mix of WORKER NODES, how can we specify the
> different classes of WNs to the outside world, instead of advertising
> that all our machines have the characteristics of our 'worst' WN?
>
> C. if we are able to deal with B, but the WNs of different classes are
> behind the same gatekeeper, what do we do when an incoming job wants a
> high-class WN, but all our free slots are on low-class WNs?
>
> It seems to me that C is what David is trying to address. I would
> prefer to solve the important problems first, C seems to be at least a
> second-order problem if not third order.
>
> I intepreted the original call for comments to be about A above, not B
> or C. The question was for A, what else can you think of besides
> 'cput' that you'd want passed from the grid layer into the LRMS layer.
>
> J "but I could be wrong" T
>
>
>
> David Rebatto wrote:
>
>> Burke, S (Stephen) ha scritto:
>>
>>> LHC Computer Grid - Rollout
>>>
>>>> [mailto:[log in to unmask]] On Behalf Of Charles Loomis
>>>>
>>>
>>>
>>> said:
>>>
>>>
>>>>> 1) Keep the current syntax, allow matching against multiple
>>>>
>>>>
>>>> subclusters,
>>>>
>>>>
>>>>> and pass the subcluster name to the batch system.
>>>>>
>>>>
>>>>
>>>> This is not a solution to the problem.
>>>>
>>>
>>>
>>>
>>> It's a solution to part of the problem, i.e. that currently jobs may
>>> avoid sites even if only 1 WN out of 500 doesn't match the requirement,
>>> because you have to publish the most restrictive limit.
>>>
>>>
>>>
>>>> I would instead opt for a hybrid approach of 2) and 3). Allow
>>>> people to define parameters like in 3) and have whatever processes
>>>> the final JDL combine those with any explicit requirements to
>>>> arrive at the full expression. Only those limits given separately
>>>> would be passed to the local batch system.
>>>>
>>>
>>>
>>>
>>> Yes, that doesn't sound too bad - but in itself it wouldn't solve the
>>> above problem, so you might still want to think about doing subclusters
>>> properly, or else changing the glue schema to go back to max/min
>>> values.
>>>
>>> Stephen
>>>
>>>
>>>
>>
>> Hi,
>> my idea was more or less the option 3) proposed by Stephen. But, as
>> he said, this doesn't solve the underusage problem created by having
>> the min values published in the GRIS. Anyway, if we go with a max/min
>> schema, or if we publish only max values, we have to face another
>> problem: how do we handle jobs dispatched to a CE when there are no
>> free nodes matching the requirements (e.g. because they are busy)?
>> The CE could reject it and the WMS retry mechanism would submit it
>> somewhere else, but this sounds very inefficient.
>> Another solution would be that the CE keeps the job queued until a
>> suitable node is free, but this would kill any hope for the WMS to
>> make any intelligent decision, as this additional queueing time would
>> not be visible in the glue schema.
>> A third option could be the implementation of a direct WMS <-> CE
>> negotiation before the actual job submission. This would require some
>> completely new code on both WMS and CE, and would be even heavier
>> than the simple check with the CE GRIS which LCG wanted removed from
>> the brokers...
>>
>> Sorry if you already discussed this problem, I've tried to read all
>> the messages in the thread but I could still have missed something...
>>
>> Cheers,
>> David
>>
|