Mark Slater [mailto:[log in to unmask]] said:
> It seems our publishing is off for LHCb but I'm not sure how
> to fix it.
> It seems the MaxCPU time is not publishing correctly and indeed:
>
> [root@epgr02 ~]# ldapsearch -LLL -x -h localhost -p 2170 -b o=grid |
> grep 999999
> ...
> GlueCEPolicyMaxCPUTime: 999999999
> ...
> GlueCEPolicyMaxCPUTime: 999999999
> ...
What limit do you have set in your batch system? Normally the info providers read it dynamically from there, and you get the "all nines" limit either if there is no limit or if the provider fails to determine it. (There is currently a ticket to have a different value for those two cases so you can distinguish them.)
To expand on this a bit: there has been quite a long-running discussion/dispute about these limits. From my point of view, as far as the information system goes having no CPU and/or wallclock limit is not an error, it just means that you will never kill jobs and the "all nines" value effectively and correctly means "infinity".
On the other hand, from an operational standpoint it may well not be a very sensible thing to do - usually you should want to have some limit to stop jobs clogging up slots and to terminate jobs which get stuck. You could potentially rely only on a wallclock limit, but usually you want a CPU limit which is less than the wallclock limit to allow jobs some headroom to e.g. stage files in and out. Relying only on a CPU limit is probably a bad move because a stuck job may not be consuming any CPU. Anyway, to my knowledge there is no formal operational policy on how the limits should be set.
LHCb is now taking a harder line since they use the limits in their scheduling, so they are requiring all sites which support them to set a finite (and presumably sensible) limit for LHCb queues - I assume that's why you got a ticket.
Stephen
--
Scanned by iCritical.
|