Hi Jiri,
I'm not running 2.6 on production boxes because my dcache nodes are also
worker nodes and I can't upgrade. Perhaps Mona and Kostas have more
information as I believe IC pools are kernel 2.6.
cheers
alessandra
On Wed, 27 Jul 2005, Jiri Mencak wrote:
> Hi Alessandra,
>
> upgrade from 2.4.21-27.0.4.EL to 2.4.21-32.0.1.EL, should have done
> that ages ago, I know, but they are not production boxes.
>
> Running 2.6 for more than a year on my desktop. Tempted to ask about
> your experience running pristine 2.6 kernels on production boxes
> (if any).
>
> Regards.
>
> --
> Jiri
>
> Words written by [log in to unmask] on 27 Jul 2005 at 15:07:08 +0100 prompted:
>> Hi Jiri,
>>
>> just out of curiosity what kernel did you put on the machines?
>>
>> thanks
>>
>> cheers
>> alessandra
>>
>> On Tue, 26 Jul 2005, Jiri Mencak wrote:
>>
>>> Hi all,
>>>
>>> sorry for replying to my own email, but I thought I'd preserve the
>>> ``thread''.
>>>
>>> After giving the dual-homed boxes some time to rest and discussing
>>> this with dCache developers, I've given the 3rd party copies (_to_
>>> dual-homed boxes) another chance. The weird thing is that they started
>>> to work! I admit to having rebooted the boxes for kernel upgrade and
>>> therefore restarted dCache, so that might have helped. I spent some
>>> time trying to replicate the problem, with no luck unfortunately.
>>> Dual-homed dCache (as described below) just works for me now.
>>>
>>> Thanks and regards.
>>>
>>> --
>>> Jiri
>>>
>>> Words written by `Jiri Mencak' on 19 Jul 2005 at 13:53:52 +0100 prompted:
>>>> Dear all,
>>>>
>>>> I've played a little bit with dual-homed machines and dCache with mixed
>>>> success. Nevertheless, it think it is worth reporting and I'm looking
>>>> forward to your feedback.
>>>>
>>>> Architecture
>>>> ~~~~~~~~~~~~
>>>> Pentium III 600
>>>>
>>>> OS
>>>> ~~
>>>> Scientific Linux SL Release 3.0.4 (SL)
>>>>
>>>> dCache
>>>> ~~~~~~
>>>> d-cache-client-1.0-100
>>>> d-cache-core-1.5.2-83
>>>> d-cache-lcg-5.0.0-1
>>>> d-cache-opt-1.5.3-84
>>>> (d-cache-gpp-v1.2.1-1)
>>>>
>>>> I have done a simplified dCache installation using the GridPP storage
>>>> dependency RPMs (no BDII etc.) to speed things up, LCG yaim 2.5.0
>>>> installation should work equally well.
>>>>
>>>> Scenario
>>>> ~~~~~~~~
>>>> Admin node: dual-homed box with a /pool on the same box
>>>> (I know, 3 dual-homed boxes would be better with no pool on the admin
>>>> node,
>>>> but this should do as a proof of concept)
>>>> Pool node: dual-homed box with a /pool
>>>>
>>>> Public Interfaces: E0a (admin.public.ac.uk), E0p (pool.public.ac.uk)
>>>> Private Interfaces: E1a (192.168.0.32), E1p (192.168.0.33)
>>>>
>>>>
>>>> E0a --------------- E1a
>>>> | |
>>>> +---| admin |---+
>>>> | | /pool | |
>>>> | --------------- |
>>>> | |
>>>> | | Private Net
>>>> Public Net ----+----- ----+-----
>>>> ------------| switch | | switch |
>>>> ----+----- ----+-----
>>>> | | | |
>>>> | | ---------------- | |
>>>> | | | | | |
>>>> | +--| pool |---+ |
>>>> | | /pool | |
>>>> | E0p ---------------- E1p |
>>>> | |
>>>> | ........................ |
>>>> | |
>>>> | O T H E R P O O Ls |
>>>>
>>>>
>>>> Installation
>>>> ~~~~~~~~~~~~
>>>> 1) Installed SL 3.0.4 and grid certificates
>>>> 2) Made sure `hostname` returns FQDN associated
>>>> with E0a and E0p, in other words, public FQDN.
>>>> 3) To make internal dCache communication pass through private
>>>> interfaces I've set up an internal DNS server to fool admin and
>>>> pool nodes into thinking admin.public.ac.uk is 192.168.0.32 and
>>>> pool.public.ac.uk is 192.168.0.33.
>>>> 4) Made sure
>>>> `hostname -d` = `grep ^search /etc/resolv.conf | awk '{print $2}'`
>>>> 5) Set up site-info.def:
>>>> MY_DOMAIN=`hostname -d`
>>>> DCACHE_ADMIN=<E1a private FQDN>
>>>> DCACHE_POOLS="`hostname -f`:2:/pool"
>>>> 6) Installed dCache using GridPP storage dependency RPMs.
>>>>
>>>>
>>>> Testing
>>>> ~~~~~~~
>>>> globus-url-copy and dCache SRM copy worked fine including third party
>>>> copying (get) _from_ dual-homed boxes. Unfortunately, third party
>>>> (put) _to_ dual-homed boxes fails. Relevant dCache log snippets
>>>> attached.
>>>>
>>>>
>>>> Tier 2 dual-homing requirements
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> It would be nice to hear what the architectural requirements from
>>>> Tier 2 sites are with regard to dual-homing are. I was working
>>>> under the assumption that the purpose of dual-homed machines was
>>>> to increase network throughput on the public interface by passing
>>>> internal dCache communication through the private interface and to
>>>> shield dCache from the outside world and expose only SRM and GridFTP
>>>> on the public interface.
>>>>
>>>> I suspect there will be other/different requirements with regard
>>>> to the dual-homed architecture so it would be nice to hear them.
>>>> Owen tells me that if you need dual-homing, your setup will almost
>>>> certainly be Lightpath on the public interface, and university network
>>>> on the private interface.
>>>>
>>>> I'm now partly leaving dCache support moving onto another project, so I
>>>> cannot guarantee I'll be working on dual-homing in the future.
>>>>
>>>> Regards.
>>>>
>>>> --
>>>> Jiri
>>>
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Failed :
>>>> CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) :
>>>> CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at
>>>> diskCacheV111.cells.PnfsManager2.getStorageInfo(PnfsManager2.java:950)
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at
>>>> diskCacheV111.cells.PnfsManager2.processPnfsMessage(PnfsManager2.java:1597)
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at
>>>> diskCacheV111.cells.PnfsManager2$ProcessThread.run(PnfsManager2.java:1518)
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at
>>>> java.lang.Thread.run(Thread.java:534)
>>>> 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag
>>>> for getSimulatedFilesize : java.io.FileNotFoundException:
>>>> /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
>>>> 07/19 10:19:12 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag
>>>> for getSimulatedFilesize : java.io.FileNotFoundException:
>>>> /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
>>>
>>>> 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request
>>>> state changed to Done
>>>> 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing
>>>> fr#-2147483522 to Done
>>>> 07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId #
>>>> -2147483521Request.createCopyRequest : created new request succesfully
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for
>>>> callerId=20000
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) :
>>>> org.dcache.srm.scheduler.NonFatalJobFailure:
>>>> CacheException(rc=666;msg=tranfer failed
>>>> :org.globus.ftp.exception.ServerException: Server refused performing the
>>>> request. Custom message: Server reported transfer failure (error code 1)
>>>> [Nested exception message: Custom message: Unexpected reply: 426
>>>> Transfer aborted, closing connection :java.net.NoRouteToHostException: No
>>>> route to host] [Nested exception is
>>>> org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message:
>>>> Unexpected reply: 426 Transfer aborted, closing connection
>>>> :java.net.NoRouteToHostException: No route to host])
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> java.lang.Thread.run(Thread.java:534)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy
>>>> failed
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) :
>>>> org.dcache.srm.scheduler.NonFatalJobFailure:
>>>> org.dcache.srm.scheduler.NonFatalJobFailure:
>>>> CacheException(rc=666;msg=tranfer failed
>>>> :org.globus.ftp.exception.ServerException: Server refused performing the
>>>> request. Custom message: Server reported transfer failure (error code 1)
>>>> [Nested exception message: Custom message: Unexpected reply: 426
>>>> Transfer aborted, closing connection :java.net.NoRouteToHostException: No
>>>> route to host] [Nested exception is
>>>> org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message:
>>>> Unexpected reply: 426 Transfer aborted, closing connection
>>>> :java.net.NoRouteToHostException: No route to host])
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
>>>> 07/19 10:20:08 Cell(SRM@srmDomain) : at
>>>> java.lang.Thread.run(Thread.java:534)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId #
>>>> -2147483521copyRequest getter_putter is non null, stopping
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId #
>>>> -2147483521changing fr#-2147483520 to Failed
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error
>>>> :
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) :
>>>> org.dcache.srm.scheduler.IllegalStateTransition: g illegal state
>>>> transition from Canceled to Failed
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.scheduler.Job.setState(Job.java:532)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.scheduler.Job.setState(Job.java:417)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.scheduler.Job.setState(Job.java:566)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.scheduler.Job.setState(Job.java:417)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.request.Request.getRequestStatus(Request.java:521)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> org.dcache.srm.SRM.getRequestStatus(SRM.java:868)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> java.lang.reflect.Method.invoke(Method.java:324)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.util.reflect.Invocation.execute(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.util.reflect.Invocation.invoke(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.service.object.ObjectService.invoke(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.server.http.ServletServer.service(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.servlet.Config.service(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.http.HTTPContext.service(HTTPContext.java:84)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.servlet.ServletContainer.service(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.http.WebServer.service(WebServer.java:87)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.socket.SocketServer.run(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.net.socket.SocketRequest.run(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> electric.util.thread.ThreadPool.run(Unknown Source)
>>>> 07/19 10:20:36 Cell(SRM@srmDomain) : at
>>>> java.lang.Thread.run(Thread.java:534)
>>>
>>
>> --
>> ********************************************
>> * Dr Alessandra Forti *
>> * Technical Coordinator - NorthGrid Tier2 *
>> * http://www.hep.man.ac.uk/u/aforti *
>> ********************************************
>
> --
> Jiri
>
--
********************************************
* Dr Alessandra Forti *
* Technical Coordinator - NorthGrid Tier2 *
* http://www.hep.man.ac.uk/u/aforti *
********************************************
|