JISCMail - TB-SUPPORT Archives

My headnode is running fine for last 45 days without any problem with
*nscd* service off. So I think it is not needed.

I do have all pool node listed in /etc/hosts file.


On Fri, Jan 4, 2013 at 1:42 PM, Mark Slater <[log in to unmask]> wrote:
> Hi All,
>
> I believe we finally have a working EMI2 head node!! After trying many
> different things, it turns out all I needed to do was make sure the nscd
> service was running (it isn't by default for my SL5 install). This has
> allowed 4500 successful Atlas transfers in the last 4 hours and it even
> seems more performant than the Glite one. I've also had no transfer errors
> at all :) We are failing nagios tests for some reason but everything else is
> working so I'm not going to touch it for a day or so before trying to tackle
> that!
>
> Now, onto what I've learnt from the various poking, reinstalling, etc:
>
> * The source of the problem definitely seems to be with either lots or slow
> DNS lookups in the DPM stack
>
> * It was magnified massively for us because we are forced to use the Google
> DNS and didn't have anything in /etc/hosts
>
> * Adding the pool nodes and head node to /etc/hosts helped but didn't fix
> the problem
>
> * I therefore guess that what DNS hammering is going on, it is not just for
> the local machines but remote ones as well
>
> * I think having the MySQL DB on the same machine helped a little, but only
> because this probably took out some more DNS lookup action (this is
> basically a guess though)
>
> * nscd was *not* required for the Glite head node. This old work horse (same
> hardware as the new one but a different physical box) did not have this
> service running and ran fine
>
> * Having said that it may be that the problem was there, just not as visible
> as with EMI. In any case, there is a definite change between Glite and
> EMI1/2.
>
>
>
> Overall, I would suggest adding as a requirement in the various twikis that
> the nscd daemon *must* be running on the head node at least (maybe it's best
> to have it on all service nodes??) and maybe putting a new bug report in to
> DPM....
>
> For those interested in recreating the problem (and the fun I've been having
> :)), simply turn off nscd and switch your DNS's to 8.8.8.8 and 4.4.4.4.
> Within 30-60mins, I'm fairly sure you should start seeing failures in
> transfers (with Atlas at least) and a look in the logs will show it as well:
>
> grep Internal /var/log/dpm/log
>
>
> The plan now is to schedule some DT and reinstall the node cleanly (it's a
> bit of a mess at the moment what with all the playing around) and also shift
> the DB back to a remote machine as that made it easier to reinstall the
> node.
>
> But first, I think I might go to the pub....
>
> Thanks for all the help!!
>
> Mark
>