Hi
On 9 March 2011 12:14, Santanu Das <[log in to unmask]> wrote:
> Thanks Matt, very useful information!!
>
> I think, I see the problem now. From the SE:
>
> [root@serv02 dpm]# traceroute disk09
> traceroute to disk09.hep.phy.cam.ac.uk (131.111.66.177), 30 hops max, 46
> byte packets
> 1 disk09 (131.111.66.177) 1.760 ms !<10> 0.064 ms !<10> 0.066 ms !<10>
>
> The !<10> means ICMP destination unreachable, which probably because
> traceroute gets confused due to the fact is both of the Ethernet interface
> got the same MAC address.
>
> Just to test, I came out of the channel bonding and everything was fine
> again. So, channel-bonding is not the thing to do with DPM?
>
Channel-bonding is fine with DPM - we have a whole bunch of disk
servers with bonded 1Gig links in one of our server rooms.
On the other hand, it looks like your channel bonding isn't working
correctly, which is probably the problem.
Sam
> Cheers,
> Santanu
>
>
> On 09/03/11 11:58, Matt Doidge wrote:
>
> Heya,
>
> Thanks Matt, that rings a bell: A couple of days ago I implemented channel
> bonding on disk09 - would that be a problem? Does DPM check the Ethernet
> device name, etc.?
>
> I don't think DPM checks stuff like interface name, but it certainly
> could be related to the bonding. Did you update any interface specific
> firewall rules on your pool node (if you have any)? Again, the routing
> might be worth checking between pool and headnode (this could have
> been mucked up by the bonding).
>
> When we had problems there were clues in the rfio logs. Another place
> to look is shift.conf on headnode and poolnode (had a problem today
> where the shift.conf on a new pool node had been configured
> hostname.internalnetwork, causing some wierdness), and see if the
> hostname on the pool is still correct.
>
> Another thing to look at is the pool certificates (I have another
> anecdote about accidentially installing the wrong host certificate on
> a node which caused wierd behaviour - Is there anything I haven't done
> wrong on a pool install?)
>
> I apologise if I'm just throwing straws at you to grasp!
> Matt
>
> rfio is running okay on disk09
>
> [root@disk09 dpm_data]# ps -ef | grep rfio | grep -v grep
> root 4501 1 0 10:31 ? 00:00:00 /opt/lcg/bin/rfiod -sl -f
> /var/log/rfio/log
>
> - Santanu
>
> On 09/03/11 11:11, Matt Doidge wrote:
>
> Heya,
> I've had a similar problem before - the problem was I had accidentily
> natted my new pool nodes - check the routing on the pools (maybe
> traceroute between the pools and headnode).
>
> The other thing I'd check is if rfio is running on the pools, and if
> the rfio port is open to the headnode (5001 I think).
>
> Hope that helps!
> Matt
>
> On 9 March 2011 10:55, Santanu Das <[log in to unmask]> wrote:
>
> Hi there,
>
> I'm having problem with one of your disk-servers - when I try to enable it I
> get this:
>
> [root@serv02 ~]# dpm-modifyfs --server disk09.hep.phy.cam.ac.uk --fs
> /dpm_data --st 0
> dpm-modifyfs disk09.hep.phy.cam.ac.uk /dpm_data: No route to host
>
> The disk/file-system is directly available from the disk09 itself and from
> the SE as well. Any idea what that problem could be?
>
> Thanks,
> Santanu
>
>
>
>
>
|