On Thu, 6 Jul 2006 09:32:51 +0100
Greig A Cowan <[log in to unmask]> wrote:
> Hi Andrew,
>
> > Does dCache's SRM check that the box hosting the pool is online
> > when the SRM answers a query about one of its files? ie is the
> > issue about not using resilient dCache just that a box/pool
> > could go offline, or that plus the danger that the SRM will be
> > falsely claiming to have files that are now offline?
>
> Like Derek has said already, if dCache can't get a file from an online
>
> pool, it expects to be able to get it from tape or for an offline pool
>
> to become available again.
This is in my understanding also, I personally don't like the NFS model
of blocking and locking the file system until it comes on line again. I
prefer the idea of failing with an error quickly. The problem is the
streaming POSIX model does not have space for presenting a currently
unavailable file system that may come back and POSIX is a widely used
standard. Because of this code developed against POSIX IO does not have
checks for media changing to off line. SRM's though don't have such
history and share nearly no state between client and server so could
fail rather than block.
I have to support the (gsi)pnfs, (I assume rfio also) POSIX layer
blocking as they need to map to current expectations for a POSIX layer
and behave in the same way as NFS.
> > Clearly, having a few percent less storage online than you have
> > in the racks isn't the end of the world (even if there are files
> > on them), but _reporting_ that you have files on those inaccessible
> > disks leads to job failures.
>
> The current computing models are such that jobs are sent to the sites
> where the data for that job resides. The location of this data is held
> within the file catalogs. If files become unavailable on your SRM due
> to some component failure then the file catalogs are not updated, so
> jobs that refer to that file will still be sent to your site and will
> most likely fail when they cannot access the data. To compensate
> against this I would say that you need some sort of inbuilt storage
> resiliency. This may be through using a RAID 5 with hot spares on your
> set of disk servers, or having some system in place which spreads file
> replicas across the disks on your WNs.
>
> As far as I know there is no middleware in place that would
> automatically copy a missing file back into the local SRM from a
> remote SRM when the absence of that file is detected. Does anyone
> else know?
I do not want circular dependencies between storage middleware
components. I believe you are correct. I personally feel that
the the SRM should only provide information that it cant find a file
and a higher level service such as the replica catalogue or FTS
should be contacted to transfer the files all as part of this higher
level service.
> I've never run a production batch farm so can't comment on the
> frequency of component failures within a <insert large number here>
> node cluster (maybe someone else can?) but I can imagine the case that
> components would be failing fairly regularly. This might not be viewed
> as too much of a problem since it is part of the normal operation of
> the compute farm, but if jobs are continuously failing due to
> unavailable storage then it will likely hit that site in their
> recorded availability (which is currently set at 95%, but that is
> another discussion).
>
> Cheers,
> Greig
Thanks for you pragmatic views Greig
Regards
Owen
>
> >
> > Cheers,
> >
> > Andrew
> >
> > -------------------------------------------------------------------
> > Dr Andrew McNab [log in to unmask] +44-(0)161-275-4227
> > Co-ordinator of Security Middleware Groups, GridPP & Manchester HEP
> > GridSite: www.gridsite.org Personal stuff: www.gridlock.org.uk
> >
>
> --
> =====================================================================
> === Dr Greig A Cowan
> http://www.ph.ed.ac.uk/~gcowan1
> School of Physics, University of Edinburgh, James Clerk Maxwell
> Building
>
> TIER-2 STORAGE SUPPORT PAGES:
> http://wiki.gridpp.ac.uk/wiki/Grid_Storage
> =====================================================================
> ===
--
###########################################################
Please note that my email address is now [log in to unmask]
###########################################################
|