> Lancaster are reporting similar problems for ATLAS - so my suspicion is
> a problem with the RAL BDII. We should failover to Imperial, so I
> suspect the RAL BDII is half working (or our publishing is half working).
To add to this, when we added failover to Imperial to our WNs we saw a
sharp decrease in these cryptic lcg-cr failures. Correlation doesn't
mean Causality however, and I haven't had a chance to look into things.
We may wish to discuss this at tomorrow's UKI meeting.
Cheers,
Matt
>
>
> Chris
>
>> Cheers,
>> Gustav
>>
>> 2011/1/17 Christopher J.Walker <[log in to unmask]>:
>>> Gustav Wikström wrote:
>>>> Hi Chris,
>>>>
>>>> OK, but shouldn't the wms take care of that, and not send jobs to QMUL then?
>>>>
>>> It should and in fact I killed the few remaining jobs before shutting
>>> things down.
>>>
>>> If you are experiencing problems at the moment, then something is wrong
>>> with the jobs. I've stopped the queues, so no jobs are running at the
>>> moment.
>>>
>>> Assuming the jobs specify their data requirements, then they should not
>>> be trying to pull data from QMUL.
>>>
>>> Chris
>>>
>>>> Cheers,
>>>> Gustav
>>>>
>>>> 2011/1/17 Christopher J.Walker <[log in to unmask]>:
>>>>> Sam Skipsey wrote:
>>>>>> Ah, this issue.
>>>>>>
>>>>>> 2011/1/17 Gustav Wikström <[log in to unmask]>:
>>>>>>> Hi experts,
>>>>>>>
>>>>>>> I'm having big trouble with my grid jobs running on qmul. The jobs
>>>>>>> seem to run ok but in the end lcg-lr fails:
>>>>> QMUL is (or at least should be ) in downtime for a power outage tomorrow
>>>>> morning. I've turned the SE off. That would explain any problems now,
>>>>> but not any before this morning.
>>>>>
>>>>>
>>>>> Scheduled to be back Wednesday evening - but will probably be back before.
>>>>>
>>>>> Chris
>>>>>
>>>>>>> lcg-cr -d srm://se03.esc.qmul.ac.uk//t2k.org/nd280/v8r5p11/unpk/ND280/ND280/00005000_00005999//oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root
>>>>>>> -l lfn:/grid/t2k.org/nd280/v8r5p11/unpk/ND280/ND280/00005000_00005999/oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root
>>>>>>> oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root
>>>>>>>
>>>>>>> ['srm://se03.esc.qmul.ac.uk//t2k.org/nd280/v8r5p11/unpk/ND280/ND280/00005000_00005999//oa_nd_spl_00005007-0003_ot3a2qrmcuec_unpk_000_v8r5p11.root:
>>>>>>> Invalid argument\n', 'lcg_cr: Invalid argument\n']
>>>>>>>
>>>>>> This error (which is horribly non-specific) is an issue with
>>>>>> communication with the BDII used to get information about the source
>>>>>> and destination systems.
>>>>>> It is likely that the issue is not with se03, but with QMUL WNs <->
>>>>>> the RAL and Imperial Top-level BDIIs.
>>>>>>
>>>>>> I'll let Chris comment on what end the problem is at...
>>>>>>
>>>>>> Sam
>>>>>>
>>>>>>> A few files end up on se03, so not all lcg-lr fails, but the vast majority does.
>>>>>>> The jobs that end up on RAL are copied without problems to
>>>>>>> srm-t2k.gridpp.rl.ac.uk/castor/ads.rl.ac.uk/prod.
>>>>>>>
>>>>>>> Is it just se03.esc.qmul.ac.uk being flaky or is lcg-cr not to be run on se03?
>>>>>>>
>>>>>>> Any help appreciated!
>>>>>>> Cheers,
>>>>>>> Gustav
>>>>>>>
|