Hi all,
sorry for replying to my own email, but I thought I'd preserve the
``thread''.
After giving the dual-homed boxes some time to rest and discussing
this with dCache developers, I've given the 3rd party copies (_to_
dual-homed boxes) another chance. The weird thing is that they started
to work! I admit to having rebooted the boxes for kernel upgrade and
therefore restarted dCache, so that might have helped. I spent some
time trying to replicate the problem, with no luck unfortunately.
Dual-homed dCache (as described below) just works for me now.
Thanks and regards.
--
Jiri
Words written by `Jiri Mencak' on 19 Jul 2005 at 13:53:52 +0100 prompted:
> Dear all,
>
> I've played a little bit with dual-homed machines and dCache with mixed
> success. Nevertheless, it think it is worth reporting and I'm looking
> forward to your feedback.
>
> Architecture
> ~~~~~~~~~~~~
> Pentium III 600
>
> OS
> ~~
> Scientific Linux SL Release 3.0.4 (SL)
>
> dCache
> ~~~~~~
> d-cache-client-1.0-100
> d-cache-core-1.5.2-83
> d-cache-lcg-5.0.0-1
> d-cache-opt-1.5.3-84
> (d-cache-gpp-v1.2.1-1)
>
> I have done a simplified dCache installation using the GridPP storage
> dependency RPMs (no BDII etc.) to speed things up, LCG yaim 2.5.0
> installation should work equally well.
>
> Scenario
> ~~~~~~~~
> Admin node: dual-homed box with a /pool on the same box
> (I know, 3 dual-homed boxes would be better with no pool on the admin node,
> but this should do as a proof of concept)
> Pool node: dual-homed box with a /pool
>
> Public Interfaces: E0a (admin.public.ac.uk), E0p (pool.public.ac.uk)
> Private Interfaces: E1a (192.168.0.32), E1p (192.168.0.33)
>
>
> E0a --------------- E1a
> | |
> +---| admin |---+
> | | /pool | |
> | --------------- |
> | |
> | | Private Net
> Public Net ----+----- ----+-----
> ------------| switch | | switch |
> ----+----- ----+-----
> | | | |
> | | ---------------- | |
> | | | | | |
> | +--| pool |---+ |
> | | /pool | |
> | E0p ---------------- E1p |
> | |
> | ........................ |
> | |
> | O T H E R P O O Ls |
>
>
> Installation
> ~~~~~~~~~~~~
> 1) Installed SL 3.0.4 and grid certificates
> 2) Made sure `hostname` returns FQDN associated
> with E0a and E0p, in other words, public FQDN.
> 3) To make internal dCache communication pass through private
> interfaces I've set up an internal DNS server to fool admin and
> pool nodes into thinking admin.public.ac.uk is 192.168.0.32 and
> pool.public.ac.uk is 192.168.0.33.
> 4) Made sure
> `hostname -d` = `grep ^search /etc/resolv.conf | awk '{print $2}'`
> 5) Set up site-info.def:
> MY_DOMAIN=`hostname -d`
> DCACHE_ADMIN=<E1a private FQDN>
> DCACHE_POOLS="`hostname -f`:2:/pool"
> 6) Installed dCache using GridPP storage dependency RPMs.
>
>
> Testing
> ~~~~~~~
> globus-url-copy and dCache SRM copy worked fine including third party
> copying (get) _from_ dual-homed boxes. Unfortunately, third party
> (put) _to_ dual-homed boxes fails. Relevant dCache log snippets
> attached.
>
>
> Tier 2 dual-homing requirements
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> It would be nice to hear what the architectural requirements from
> Tier 2 sites are with regard to dual-homing are. I was working
> under the assumption that the purpose of dual-homed machines was
> to increase network throughput on the public interface by passing
> internal dCache communication through the private interface and to
> shield dCache from the outside world and expose only SRM and GridFTP
> on the public interface.
>
> I suspect there will be other/different requirements with regard
> to the dual-homed architecture so it would be nice to hear them.
> Owen tells me that if you need dual-homing, your setup will almost
> certainly be Lightpath on the public interface, and university network
> on the private interface.
>
> I'm now partly leaving dCache support moving onto another project, so I
> cannot guarantee I'll be working on dual-homing in the future.
>
> Regards.
>
> --
> Jiri
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Failed : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2.getStorageInfo(PnfsManager2.java:950)
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2.processPnfsMessage(PnfsManager2.java:1597)
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2$ProcessThread.run(PnfsManager2.java:1518)
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at java.lang.Thread.run(Thread.java:534)
> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
> 07/19 10:19:12 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
> 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request state changed to Done
> 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing fr#-2147483522 to Done
> 07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521Request.createCopyRequest : created new request succesfully
> 07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for callerId=20000
> 07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
> 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
> 07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy failed
> 07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
> 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
> 07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
> 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521copyRequest getter_putter is non null, stopping
> 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521changing fr#-2147483520 to Failed
> 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error :
> 07/19 10:20:36 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.IllegalStateTransition: g illegal state transition from Canceled to Failed
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:532)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:566)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.Request.getRequestStatus(Request.java:521)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.SRM.getRequestStatus(SRM.java:868)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.reflect.Method.invoke(Method.java:324)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.execute(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.invoke(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.service.object.ObjectService.invoke(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.server.http.ServletServer.service(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.Config.service(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.HTTPContext.service(HTTPContext.java:84)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.ServletContainer.service(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.WebServer.service(WebServer.java:87)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketServer.run(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketRequest.run(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.thread.ThreadPool.run(Unknown Source)
> 07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
|