Hi Wahid,
you are right atlas jobs do rfcp /dpm/... for the input. This is how
they failed this morning in Manchester
Get error: rfcp failed: 512,
/dpm/tier2.hep.manchester.ac.uk/home/atlas/atlashotdisk/ddo/DBRelease/v070501/ddo.000001.Atlas.Ideal.DBRelease.v070501/DBRelease-7.5.1.tar.gz
: Input/output error (error 5 on se02.tier2.hep.manchester.ac.uk)
se02 is the pool with I/O problems and there are other 3 copies of that
file on DPM.
The failover problem remains standing and it seems there is no retry on
the experiment side otherwise other copies would have been accessed and
we wouldn't have had these failures.
==========
A second mode of failure is put failure with lcg-cr but I assume that
happened when the pool started to have problems because the last
failures are at 4 am this morning. I hope DPM does check if a fs is
writable before choosing it.
cheers
alessandra
cheers
alessandra
Wahid Bhimji wrote:
> Hi
>> With rfcp you access the replica directly you don't leave any choice
>> to DPM.
> Not if you do rfcp /dpm/ecdf.ed.ac.uk/home/atlas/atlasscratchdisk/bob
> ./bill7 locally as I did. Then it picks one or other.
>>
>> > So I think that the static choice is either an "urban myth" or the
>> jobs in question are doing something "other"
>>
>> the first time I noticed or heard anybody else noticed a similar
>> behaviour was today. Not a very long lived myth.
>>
> good - I like to bust myths as soon as I hear them getting started ;-)
>
>> > though most VO software would I think give it another shot or 2?
>>
>> no they don't. The job fails if the file is not returned.
> "FileStager" definitely gives it another go (at least it when it
> couldn't rfcp files here because of a different reason)
>
> Maybe the conditions copy doesn't though (which is a shame since that
> is the stuff most likely to be replicated)
>
> Wahid
>> cheers
>> alessandra
>>
>
>
--
The most effective way to do it, is to do it. (Amelia Earhart)
Northgrid Tier2 Technical Coordinator
http://www.hep.manchester.ac.uk/computing/tier2
|