On Fri, 14 May 2004, Ian Stokes-Rees wrote:
> TOTAL CPU: 2262
> FREE CPU: 1364
>
> RUNNING JOBS: 527
> WAITING JOBS: 848
I think the cpus are being multiply-counted. Each CE (i.e. each batch
queue) publishes its own cpu count, and it looks to me that gridice is
adding them all together, e.g. it says that RAL has 432 cpus but in fact
it seems to have 72 dual processor WNs, and 3*2*72 is 432 (3 queues, so 3
CEs). If that's typical there would be more like 700 cpus in reality, so
527 jobs is pretty full. (It also says RAL has 411 free slots but only 7
jobs running: 7*3 =21, + 411 = 432 ...)
> Also, for the ~850 WAITING jobs, given the ~1350 FREE CPUs, that seems
> quite surprising. Is there any way to find out why the waiting jobs
> won't match to the free CPUs?
Look at the site breakdown. The waiting jobs are nearly all at NIKHEF,
which has the VO limit, so probably whoever has submitted them has hit
their quota even though some nodes are free. The only other site with many
jobs waiting is Torino which really does seem to be full, presumably they
are either local jobs or have been sent there deliberately.
...
in fact looking a bit more, the NIKHEF jobs are mostly owned by "willem"
and "tdykstra", i.e. local users. At Torino the queued jobs mostly belong
to dteam, so test jobs of some kind, and some to alicesgm, i.e. the alice
software manager account which is presumably trying to install or test
the software at that site.
Stephen
|