Sam's right, bonding works for us (although on our dpm we have
internal, bonded interfaces facing the worker nodes and single,
external NICs used for WAN traffic *and* dpm communication - the
headnode only knows of the external NICs).
Here's what our bonding is implemented:
For the bond itself:
# cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Nic bonding configuration. Updated by PCM/Kusu
DEVICE=bond0
IPADDR=10.41.12.13
NETMASK=255.255.240.0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
BONDING_OPTS='mode=balance-xor miimon=100'
For each NIC in the bond:
# cat /etc/sysconfig/network-scripts/ifcfg-eth2
# Nic bonding configuration. Updated by PCM/Kusu
DEVICE=eth2
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes
We haven't had any luck getting dhcp to work with our bonded
interfaces, but we haven't really tried.
Cheers,
Matt
On 9 March 2011 12:18, Sam Skipsey <[log in to unmask]> wrote:
> Hi
>
> On 9 March 2011 12:14, Santanu Das <[log in to unmask]> wrote:
>> Thanks Matt, very useful information!!
>>
>> I think, I see the problem now. From the SE:
>>
>> [root@serv02 dpm]# traceroute disk09
>> traceroute to disk09.hep.phy.cam.ac.uk (131.111.66.177), 30 hops max, 46
>> byte packets
>> 1 disk09 (131.111.66.177) 1.760 ms !<10> 0.064 ms !<10> 0.066 ms !<10>
>>
>> The !<10> means ICMP destination unreachable, which probably because
>> traceroute gets confused due to the fact is both of the Ethernet interface
>> got the same MAC address.
>>
>> Just to test, I came out of the channel bonding and everything was fine
>> again. So, channel-bonding is not the thing to do with DPM?
>>
>
> Channel-bonding is fine with DPM - we have a whole bunch of disk
> servers with bonded 1Gig links in one of our server rooms.
> On the other hand, it looks like your channel bonding isn't working
> correctly, which is probably the problem.
>
> Sam
>
>> Cheers,
>> Santanu
>>
>>
>> On 09/03/11 11:58, Matt Doidge wrote:
>>
>> Heya,
>>
>> Thanks Matt, that rings a bell: A couple of days ago I implemented channel
>> bonding on disk09 - would that be a problem? Does DPM check the Ethernet
>> device name, etc.?
>>
>> I don't think DPM checks stuff like interface name, but it certainly
>> could be related to the bonding. Did you update any interface specific
>> firewall rules on your pool node (if you have any)? Again, the routing
>> might be worth checking between pool and headnode (this could have
>> been mucked up by the bonding).
>>
>> When we had problems there were clues in the rfio logs. Another place
>> to look is shift.conf on headnode and poolnode (had a problem today
>> where the shift.conf on a new pool node had been configured
>> hostname.internalnetwork, causing some wierdness), and see if the
>> hostname on the pool is still correct.
>>
>> Another thing to look at is the pool certificates (I have another
>> anecdote about accidentially installing the wrong host certificate on
>> a node which caused wierd behaviour - Is there anything I haven't done
>> wrong on a pool install?)
>>
>> I apologise if I'm just throwing straws at you to grasp!
>> Matt
>>
>> rfio is running okay on disk09
>>
>> [root@disk09 dpm_data]# ps -ef | grep rfio | grep -v grep
>> root 4501 1 0 10:31 ? 00:00:00 /opt/lcg/bin/rfiod -sl -f
>> /var/log/rfio/log
>>
>> - Santanu
>>
>> On 09/03/11 11:11, Matt Doidge wrote:
>>
>> Heya,
>> I've had a similar problem before - the problem was I had accidentily
>> natted my new pool nodes - check the routing on the pools (maybe
>> traceroute between the pools and headnode).
>>
>> The other thing I'd check is if rfio is running on the pools, and if
>> the rfio port is open to the headnode (5001 I think).
>>
>> Hope that helps!
>> Matt
>>
>> On 9 March 2011 10:55, Santanu Das <[log in to unmask]> wrote:
>>
>> Hi there,
>>
>> I'm having problem with one of your disk-servers - when I try to enable it I
>> get this:
>>
>> [root@serv02 ~]# dpm-modifyfs --server disk09.hep.phy.cam.ac.uk --fs
>> /dpm_data --st 0
>> dpm-modifyfs disk09.hep.phy.cam.ac.uk /dpm_data: No route to host
>>
>> The disk/file-system is directly available from the disk09 itself and from
>> the SE as well. Any idea what that problem could be?
>>
>> Thanks,
>> Santanu
>>
>>
>>
>>
>>
>
|