Dear all,
I've played a little bit with dual-homed machines and dCache with mixed
success. Nevertheless, it think it is worth reporting and I'm looking
forward to your feedback.
Architecture
~~~~~~~~~~~~
Pentium III 600
OS
~~
Scientific Linux SL Release 3.0.4 (SL)
dCache
~~~~~~
d-cache-client-1.0-100
d-cache-core-1.5.2-83
d-cache-lcg-5.0.0-1
d-cache-opt-1.5.3-84
(d-cache-gpp-v1.2.1-1)
I have done a simplified dCache installation using the GridPP storage
dependency RPMs (no BDII etc.) to speed things up, LCG yaim 2.5.0
installation should work equally well.
Scenario
~~~~~~~~
Admin node: dual-homed box with a /pool on the same box
(I know, 3 dual-homed boxes would be better with no pool on the admin node,
but this should do as a proof of concept)
Pool node: dual-homed box with a /pool
Public Interfaces: E0a (admin.public.ac.uk), E0p (pool.public.ac.uk)
Private Interfaces: E1a (192.168.0.32), E1p (192.168.0.33)
E0a --------------- E1a
| |
+---| admin |---+
| | /pool | |
| --------------- |
| |
| | Private Net
Public Net ----+----- ----+-----
------------| switch | | switch |
----+----- ----+-----
| | | |
| | ---------------- | |
| | | | | |
| +--| pool |---+ |
| | /pool | |
| E0p ---------------- E1p |
| |
| ........................ |
| |
| O T H E R P O O Ls |
Installation
~~~~~~~~~~~~
1) Installed SL 3.0.4 and grid certificates
2) Made sure `hostname` returns FQDN associated
with E0a and E0p, in other words, public FQDN.
3) To make internal dCache communication pass through private
interfaces I've set up an internal DNS server to fool admin and
pool nodes into thinking admin.public.ac.uk is 192.168.0.32 and
pool.public.ac.uk is 192.168.0.33.
4) Made sure
`hostname -d` = `grep ^search /etc/resolv.conf | awk '{print $2}'`
5) Set up site-info.def:
MY_DOMAIN=`hostname -d`
DCACHE_ADMIN=<E1a private FQDN>
DCACHE_POOLS="`hostname -f`:2:/pool"
6) Installed dCache using GridPP storage dependency RPMs.
Testing
~~~~~~~
globus-url-copy and dCache SRM copy worked fine including third party
copying (get) _from_ dual-homed boxes. Unfortunately, third party
(put) _to_ dual-homed boxes fails. Relevant dCache log snippets
attached.
Tier 2 dual-homing requirements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It would be nice to hear what the architectural requirements from
Tier 2 sites are with regard to dual-homing are. I was working
under the assumption that the purpose of dual-homed machines was
to increase network throughput on the public interface by passing
internal dCache communication through the private interface and to
shield dCache from the outside world and expose only SRM and GridFTP
on the public interface.
I suspect there will be other/different requirements with regard
to the dual-homed architecture so it would be nice to hear them.
Owen tells me that if you need dual-homing, your setup will almost
certainly be Lightpath on the public interface, and university network
on the private interface.
I'm now partly leaving dCache support moving onto another project, so I
cannot guarantee I'll be working on dual-homing in the future.
Regards.
--
Jiri
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Failed : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2.getStorageInfo(PnfsManager2.java:950)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2.processPnfsMessage(PnfsManager2.java:1597)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2$ProcessThread.run(PnfsManager2.java:1518)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at java.lang.Thread.run(Thread.java:534)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
07/19 10:19:12 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request state changed to Done
07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing fr#-2147483522 to Done
07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521Request.createCopyRequest : created new request succesfully
07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for callerId=20000
07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666)
07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770)
07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy failed
07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798)
07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521copyRequest getter_putter is non null, stopping
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521changing fr#-2147483520 to Failed
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error :
07/19 10:20:36 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.IllegalStateTransition: g illegal state transition from Canceled to Failed
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:532)
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417)
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952)
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:566)
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417)
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.Request.getRequestStatus(Request.java:521)
07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.SRM.getRequestStatus(SRM.java:868)
07/19 10:20:36 Cell(SRM@srmDomain) : at diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360)
07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.reflect.Method.invoke(Method.java:324)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.execute(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.invoke(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.service.object.ObjectService.invoke(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.server.http.ServletServer.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.Config.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.HTTPContext.service(HTTPContext.java:84)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.ServletContainer.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.WebServer.service(WebServer.java:87)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketServer.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketRequest.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.thread.ThreadPool.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
|