LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Dimitris Zilaskos
said:
> But what about the part that runs on the grid? Or there is an
> "acceptable" failure rate cause of the bulk submission that I
> think LHC is currently doing?
Well, it's not so much what failure rate is acceptable, as what rate you
have to accept! For example, have a look at the atlas dashboard for the
last 24 hours:
http://dashb-atlas-data-test.cern.ch/dashboard/request.py/site?name=&sta
tsInterval=24
That's actually relatively good, but there are still something like
60,000 transfer errors, i.e. 60k cases when FTS tried to copy a file
from one SE to another and got an error. (As it happens about half of
those were at RAL, which was hit by the infamous castor "big ID" bug
last night.) In practice you just have to accept a high failure rate and
deal with it. Of course for user jobs things may be different, users are
probably less tolerant and we have to see what happens ...
Stephen
--
Scanned by iCritical.
|