> This is a really hard (brittle) system: it lets the RB take the file
> catalogs word for it about what files are available without having to do
> extra work, but means when they get out of sync with reality, the jobs
> start dying.
>
> So to turn it into a soft (tough) system, the RB would need to issue,
> what, an srmLs against the SURL it's found from the catalog and check
> the SRM is still happy about that file?
I agree with what you are saying about the current systems being brittle.
The Phedex system from CMS effectively does a directory listing (not srmLs
yet) on the SRM once it has transfered a batch of files, but I do not
think this is done to check that files are OK before a job is sent.
The problem with dCache (and DPM) is that since they have a namespace, an
srmLs would return that all files were available, even if one of its pools
had gone offline. They will not know that the files are unavailable until
they actually try and access them. But this is a limitation (maybe it's a
feature) with these implementations, as I understand it something like
your slashgrid SRM (or xrootd-SRM) would not have this problem.
Maybe once srmLs is fully supported by all SRMs then the job matching
that is done by the RB will start to use the dynamic information from the
SRMs themselves rather than the file catalogs. I would imagine that this
will be a long way off though.
Cheers,
Greig
>
> > To compensate against this I
> > would say that you need some sort of inbuilt storage resiliency. This may
> > be through using a RAID 5 with hot spares on your set of disk servers, or
> > having some system in place which spreads file replicas across the disks
> > on your WNs.
>
> I think this is needed for performance reasons, but I hope we can get
> away from systems needing it because they're brittle (ie don't have a
> way of checking current status before doing something) over time.
>
> Cheers,
>
> Andrew
>
> -------------------------------------------------------------------
> Dr Andrew McNab [log in to unmask] +44-(0)161-275-4227
> Co-ordinator of Security Middleware Groups, GridPP & Manchester HEP
> GridSite: www.gridsite.org Personal stuff: www.gridlock.org.uk
>
--
=======================================================================
Dr Greig A Cowan http://www.ph.ed.ac.uk/~gcowan1
School of Physics, University of Edinburgh, James Clerk Maxwell Building
TIER-2 STORAGE SUPPORT PAGES: http://wiki.gridpp.ac.uk/wiki/Grid_Storage
=======================================================================
|