Greig A Cowan wrote:
> Hi Andrew,
>
>> Clearly, having a few percent less storage online than you have
>> in the racks isn't the end of the world (even if there are files
>> on them), but _reporting_ that you have files on those inaccessible
>> disks leads to job failures.
>
> The current computing models are such that jobs are sent to the sites
> where the data for that job resides. The location of this data is held
> within the file catalogs.
Sorry, yes, of course.
> If files become unavailable on your SRM due to
> some component failure then the file catalogs are not updated, so jobs
> that refer to that file will still be sent to your site and will most
> likely fail when they cannot access the data.
This is a really hard (brittle) system: it lets the RB take the file
catalogs word for it about what files are available without having to do
extra work, but means when they get out of sync with reality, the jobs
start dying.
So to turn it into a soft (tough) system, the RB would need to issue,
what, an srmLs against the SURL it's found from the catalog and check
the SRM is still happy about that file?
> To compensate against this I
> would say that you need some sort of inbuilt storage resiliency. This may
> be through using a RAID 5 with hot spares on your set of disk servers, or
> having some system in place which spreads file replicas across the disks
> on your WNs.
I think this is needed for performance reasons, but I hope we can get
away from systems needing it because they're brittle (ie don't have a
way of checking current status before doing something) over time.
Cheers,
Andrew
-------------------------------------------------------------------
Dr Andrew McNab [log in to unmask] +44-(0)161-275-4227
Co-ordinator of Security Middleware Groups, GridPP & Manchester HEP
GridSite: www.gridsite.org Personal stuff: www.gridlock.org.uk
|