Gustav Wikström wrote:
> Hi Chris,
>
> that explains the cancelled jobs I see (it would be nice to have
> something of a notice for these cases),
I marked QMUL as down in the gocdb. I also told Ben. The downtime is in
the gocdb, but I didn't get a broadcast of it (nor did Duncan), so
something looks like it has gone wrong in there somewhere.
> but the failures must then
> have come before you turned it off, so it should be unrelated to that.
>
> Do you have anything to add on what Sam wrote:
>
> "It is likely that the issue is not with se03, but with QMUL WNs <->
> the RAL and Imperial Top-level BDIIs." ?
>
Another t2k.org user reported similar problems.
Can you give times that the jobs failed and also start using the -v, or
--verbose option to help debugging.
Lancaster are reporting similar problems for ATLAS - so my suspicion is
a problem with the RAL BDII. We should failover to Imperial, so I
suspect the RAL BDII is half working (or our publishing is half working).
Chris
> Cheers,
> Gustav
>
> 2011/1/17 Christopher J.Walker <[log in to unmask]>:
>> Gustav Wikström wrote:
>>> Hi Chris,
>>>
>>> OK, but shouldn't the wms take care of that, and not send jobs to QMUL then?
>>>
>> It should and in fact I killed the few remaining jobs before shutting
>> things down.
>>
>> If you are experiencing problems at the moment, then something is wrong
>> with the jobs. I've stopped the queues, so no jobs are running at the
>> moment.
>>
>> Assuming the jobs specify their data requirements, then they should not
>> be trying to pull data from QMUL.
>>
>> Chris
>>
>>> Cheers,
>>> Gustav
>>>
>>> 2011/1/17 Christopher J.Walker <[log in to unmask]>:
>>>> Sam Skipsey wrote:
>>>>> Ah, this issue.
>>>>>
>>>>> 2011/1/17 Gustav Wikström <[log in to unmask]>:
>>>>>> Hi experts,
>>>>>>
>>>>>> I'm having big trouble with my grid jobs running on qmul. The jobs
>>>>>> seem to run ok but in the end lcg-lr fails:
>>>> QMUL is (or at least should be ) in downtime for a power outage tomorrow
>>>> morning. I've turned the SE off. That would explain any problems now,
>>>> but not any before this morning.
>>>>
>>>>
>>>> Scheduled to be back Wednesday evening - but will probably be back before.
>>>>
>>>> Chris
>>>>
>>>>>> lcg-cr -d srm://se03.esc.qmul.ac.uk//t2k.org/nd280/v8r5p11/unpk/ND280/ND280/00005000_00005999//oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root
>>>>>> -l lfn:/grid/t2k.org/nd280/v8r5p11/unpk/ND280/ND280/00005000_00005999/oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root
>>>>>> oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root
>>>>>>
>>>>>> ['srm://se03.esc.qmul.ac.uk//t2k.org/nd280/v8r5p11/unpk/ND280/ND280/00005000_00005999//oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root:
>>>>>> Invalid argument\n', 'lcg_cr: Invalid argument\n']
>>>>>>
>>>>> This error (which is horribly non-specific) is an issue with
>>>>> communication with the BDII used to get information about the source
>>>>> and destination systems.
>>>>> It is likely that the issue is not with se03, but with QMUL WNs <->
>>>>> the RAL and Imperial Top-level BDIIs.
>>>>>
>>>>> I'll let Chris comment on what end the problem is at...
>>>>>
>>>>> Sam
>>>>>
>>>>>> A few files end up on se03, so not all lcg-lr fails, but the vast majority does.
>>>>>> The jobs that end up on RAL are copied without problems to
>>>>>> srm-t2k.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod.
>>>>>>
>>>>>> Is it just se03.esc.qmul.ac.uk being flaky or is lcg-cr not to be run on se03?
>>>>>>
>>>>>> Any help appreciated!
>>>>>> Cheers,
>>>>>> Gustav
>>>>>>
|