Hi Govind,
But do you have a fast DNS lookup? That's probably where we were getting
hit by it. If you're using a reliable uni one then it's probably fine
but if you can't or the DNS is a little slow, then that's where I think
the problems start.
Thanks,
Mark
On 04/01/13 13:49, Govind Songara wrote:
> My headnode is running fine for last 45 days without any problem with
> *nscd* service off. So I think it is not needed.
>
> I do have all pool node listed in /etc/hosts file.
>
>
> On Fri, Jan 4, 2013 at 1:42 PM, Mark Slater <[log in to unmask]> wrote:
>> Hi All,
>>
>> I believe we finally have a working EMI2 head node!! After trying many
>> different things, it turns out all I needed to do was make sure the nscd
>> service was running (it isn't by default for my SL5 install). This has
>> allowed 4500 successful Atlas transfers in the last 4 hours and it even
>> seems more performant than the Glite one. I've also had no transfer errors
>> at all :) We are failing nagios tests for some reason but everything else is
>> working so I'm not going to touch it for a day or so before trying to tackle
>> that!
>>
>> Now, onto what I've learnt from the various poking, reinstalling, etc:
>>
>> * The source of the problem definitely seems to be with either lots or slow
>> DNS lookups in the DPM stack
>>
>> * It was magnified massively for us because we are forced to use the Google
>> DNS and didn't have anything in /etc/hosts
>>
>> * Adding the pool nodes and head node to /etc/hosts helped but didn't fix
>> the problem
>>
>> * I therefore guess that what DNS hammering is going on, it is not just for
>> the local machines but remote ones as well
>>
>> * I think having the MySQL DB on the same machine helped a little, but only
>> because this probably took out some more DNS lookup action (this is
>> basically a guess though)
>>
>> * nscd was *not* required for the Glite head node. This old work horse (same
>> hardware as the new one but a different physical box) did not have this
>> service running and ran fine
>>
>> * Having said that it may be that the problem was there, just not as visible
>> as with EMI. In any case, there is a definite change between Glite and
>> EMI1/2.
>>
>>
>>
>> Overall, I would suggest adding as a requirement in the various twikis that
>> the nscd daemon *must* be running on the head node at least (maybe it's best
>> to have it on all service nodes??) and maybe putting a new bug report in to
>> DPM....
>>
>> For those interested in recreating the problem (and the fun I've been having
>> :)), simply turn off nscd and switch your DNS's to 8.8.8.8 and 4.4.4.4.
>> Within 30-60mins, I'm fairly sure you should start seeing failures in
>> transfers (with Atlas at least) and a look in the logs will show it as well:
>>
>> grep Internal /var/log/dpm/log
>>
>>
>> The plan now is to schedule some DT and reinstall the node cleanly (it's a
>> bit of a mess at the moment what with all the playing around) and also shift
>> the DB back to a remote machine as that made it easier to reinstall the
>> node.
>>
>> But first, I think I might go to the pub....
>>
>> Thanks for all the help!!
>>
>> Mark
>>
|