My headnode is running fine for last 45 days without any problem with *nscd* service off. So I think it is not needed. I do have all pool node listed in /etc/hosts file. On Fri, Jan 4, 2013 at 1:42 PM, Mark Slater <[log in to unmask]> wrote: > Hi All, > > I believe we finally have a working EMI2 head node!! After trying many > different things, it turns out all I needed to do was make sure the nscd > service was running (it isn't by default for my SL5 install). This has > allowed 4500 successful Atlas transfers in the last 4 hours and it even > seems more performant than the Glite one. I've also had no transfer errors > at all :) We are failing nagios tests for some reason but everything else is > working so I'm not going to touch it for a day or so before trying to tackle > that! > > Now, onto what I've learnt from the various poking, reinstalling, etc: > > * The source of the problem definitely seems to be with either lots or slow > DNS lookups in the DPM stack > > * It was magnified massively for us because we are forced to use the Google > DNS and didn't have anything in /etc/hosts > > * Adding the pool nodes and head node to /etc/hosts helped but didn't fix > the problem > > * I therefore guess that what DNS hammering is going on, it is not just for > the local machines but remote ones as well > > * I think having the MySQL DB on the same machine helped a little, but only > because this probably took out some more DNS lookup action (this is > basically a guess though) > > * nscd was *not* required for the Glite head node. This old work horse (same > hardware as the new one but a different physical box) did not have this > service running and ran fine > > * Having said that it may be that the problem was there, just not as visible > as with EMI. In any case, there is a definite change between Glite and > EMI1/2. > > > > Overall, I would suggest adding as a requirement in the various twikis that > the nscd daemon *must* be running on the head node at least (maybe it's best > to have it on all service nodes??) and maybe putting a new bug report in to > DPM.... > > For those interested in recreating the problem (and the fun I've been having > :)), simply turn off nscd and switch your DNS's to 8.8.8.8 and 4.4.4.4. > Within 30-60mins, I'm fairly sure you should start seeing failures in > transfers (with Atlas at least) and a look in the logs will show it as well: > > grep Internal /var/log/dpm/log > > > The plan now is to schedule some DT and reinstall the node cleanly (it's a > bit of a mess at the moment what with all the playing around) and also shift > the DB back to a remote machine as that made it easier to reinstall the > node. > > But first, I think I might go to the pub.... > > Thanks for all the help!! > > Mark >