Rob Fay wrote:
> Out of interest, what's the limiting factor(s) on the number of these
> jobs running simultaneously at a particular site?
>
> The Liverpool cluster has only been around half full over the last day,
> and nothing is bottlenecked at this end (bandwidth, etc.), so I'm
> guessing there's another limit set somewhere?
I was slightly worried about the fact that our cluster wasn't (and
indeed still isn't) full.
Dan (ccd) increased the number of jobs running and at 17:39, we peaked
at 510 jobs running, though it has now dropped back to 197.
The load on ce01, our old CE, went up to around 18 - could that be
causing a problem getting jobs into the cluster?
If that is the case, it may be worth trying sending some jobs via our
new CE, ce03. We did have problems with it though, and while it is now
passing SAM tests, it is perhaps better not to mix up two tests at the
same time.
Chris
PS We seem to have overtaken Glasgow :-).
>
> Rob
>
> Graeme Stewart wrote:
>> So, a quick look here:
>>
>> http://panda.cern.ch:25980/server/pandamon/query?dash=analysis
>>
>> suggests
>>
>> Lancaster: Some internal storage problems:
>>
>> http://voatlas19.cern.ch:25980/server/pandamon/query?job=1012759079
>> "Get error: rfcp failed: 512,
>> /dpm/lancs.ac.uk/home/atlas/atlasmcdisk/mc08/AOD/mc08.106453.AMSB4_jimmy_susy.merge.AOD.e357_s462_r635_t53_tid068283/AOD.068283._00001.pool.root.1
>>
>> : No route to host"
>>
>>
>> Manc-2: Some WNs short of scratch space:
>>
>> http://voatlas19.cern.ch:25980/server/pandamon/query?job=1012742162
>> Too little space left on local disk to run job: 2051072 kB (need >
>> 2097152 kB)
>>
>>
>> Oxford: Some problems I don't understand
>>
>> http://panda.cern.ch:25980/server/pandamon/query?mode=archive&type=analysis&computingSite=ANALY_OX&jobStatus=failed&hours=24
>>
>>
>> "task buffer expired"
>>
>> Are their analysis pilots running?
>>
>>
>> Sheffield: LFC lookup problems and stage-in/out problems (network
>> issues?):
>>
>> http://panda.cern.ch:25980/server/pandamon/query?job=1012762913
>> http://panda.cern.ch:25980/server/pandamon/query?job=1012750235
>> http://panda.cern.ch:25980/server/pandamon/query?job=1012742714
>>
>>
>> Glasgow is ahead for now, but QMUL is coming up fast with their
>> supercharged lustre system....
>>
>> Graeme
>>
>> On Wed, Jun 24, 2009 at 11:38, Daniel van der
>> Ster<[log in to unmask]> wrote:
>>> Test 479 set to start at 12:00 today.
>>> Cheers,
>>> Dan
>>>
>>>
>>> 2009/6/24 Graeme Stewart <[log in to unmask]>:
>>>> Hi Dan/Johannes
>>>>
>>>> We want to test file:/// access at QMUL, which is now setup in panda.
>>>> Could you start a panda hammercloud cloud for the UK, to last until
>>>> midnight tonight? This should allow the site(s) to do a good sweep
>>>> through the number of running jobs in their systems.
>>>>
>>>> I'm anticipating general interest so please send to all the UK ANALY
>>>> queues. (Any site which is in a bad shape for testing can shut off
>>>> pilots or apply severe batch system limits.)
>>>>
>>>> Thanks
>>>>
>>>> Graeme
>>>>
>>>> PS. Sorry for the short notice, but RAL LFC is down tomorrow, so we'd
>>>> like to get one test done today.
>>>>
>>>> --
>>>> Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/
>>>> Department of Physics and Astronomy, University of Glasgow, Scotland
>>>> DEATH TO MEETINGS!
>>>>
>>
>>
>>
>
|