Probably to wiki.
https://www.gridpp.ac.uk/wiki/Storage_Issues
I will add a brief description of the currently least bad solution.
I know it's probably bad style to number explicitly but how
do you otherwise add text in an numbered list, like the
itemize environment in latex?
-j
-----Original Message-----
From: GRIDPP2: Deployment and support of SRM and local storage
management [mailto:[log in to unmask]]On Behalf Of Coles, J
(Jeremy)
Sent: 06 July 2006 18:55
To: [log in to unmask]
Subject: Re: Resilient dCache (was Re: Minutes of today's phone conf)
Hi Grieg
We should have an area to record observations like this which we might
want to put forward for future middleware requirements:
> As far as I know there is no middleware in place that would
automatically
> copy a missing file back into the local SRM from a remote SRM when the
> absence of that file is detected. Does anyone else know?
To wiki or not to wiki?
Cheers,
Jeremy
> -----Original Message-----
> From: GRIDPP2: Deployment and support of SRM and local storage
management
> [mailto:[log in to unmask]] On Behalf Of Greig A Cowan
> Sent: 06 July 2006 09:33
> To: [log in to unmask]
> Subject: Re: Resilient dCache (was Re: Minutes of today's phone conf)
>
> Hi Andrew,
>
> > Does dCache's SRM check that the box hosting the pool is online
> > when the SRM answers a query about one of its files? ie is the
> > issue about not using resilient dCache just that a box/pool
> > could go offline, or that plus the danger that the SRM will be
> > falsely claiming to have files that are now offline?
>
> Like Derek has said already, if dCache can't get a file from an online
> pool, it expects to be able to get it from tape or for an offline pool
> to become available again.
>
> > Clearly, having a few percent less storage online than you have
> > in the racks isn't the end of the world (even if there are files
> > on them), but _reporting_ that you have files on those inaccessible
> > disks leads to job failures.
>
> The current computing models are such that jobs are sent to the sites
> where the data for that job resides. The location of this data is held
> within the file catalogs. If files become unavailable on your SRM due
to
> some component failure then the file catalogs are not updated, so jobs
> that refer to that file will still be sent to your site and will most
> likely fail when they cannot access the data. To compensate against
this I
> would say that you need some sort of inbuilt storage resiliency. This
may
> be through using a RAID 5 with hot spares on your set of disk servers,
or
> having some system in place which spreads file replicas across the
disks
> on your WNs.
>
> As far as I know there is no middleware in place that would
automatically
> copy a missing file back into the local SRM from a remote SRM when the
> absence of that file is detected. Does anyone else know?
>
> I've never run a production batch farm so can't comment on the
frequency
> of component failures within a <insert large number here> node cluster
> (maybe someone else can?) but I can imagine the case that components
would
> be failing fairly regularly. This might not be viewed as too much of a
> problem since it is part of the normal operation of the compute farm,
but
> if jobs are continuously failing due to unavailable storage then it
will
> likely hit that site in their recorded availability (which is
currently
> set at 95%, but that is another discussion).
>
> Cheers,
> Greig
>
> >
> > Cheers,
> >
> > Andrew
> >
> > -------------------------------------------------------------------
> > Dr Andrew McNab [log in to unmask] +44-(0)161-275-4227
> > Co-ordinator of Security Middleware Groups, GridPP & Manchester HEP
> > GridSite: www.gridsite.org Personal stuff: www.gridlock.org.uk
> >
>
> --
>
========================================================================
> Dr Greig A Cowan
http://www.ph.ed.ac.uk/~gcowan1
> School of Physics, University of Edinburgh, James Clerk Maxwell
Building
>
> TIER-2 STORAGE SUPPORT PAGES:
http://wiki.gridpp.ac.uk/wiki/Grid_Storage
>
========================================================================
|