Hi Mark, Yes we using Uni's one. Cheers Govind On Fri, Jan 4, 2013 at 1:53 PM, Mark Slater <[log in to unmask]> wrote: > Hi Govind, > > But do you have a fast DNS lookup? That's probably where we were getting hit > by it. If you're using a reliable uni one then it's probably fine but if you > can't or the DNS is a little slow, then that's where I think the problems > start. > > Thanks, > > Mark > > > On 04/01/13 13:49, Govind Songara wrote: >> >> My headnode is running fine for last 45 days without any problem with >> *nscd* service off. So I think it is not needed. >> >> I do have all pool node listed in /etc/hosts file. >> >> >> On Fri, Jan 4, 2013 at 1:42 PM, Mark Slater <[log in to unmask]> wrote: >>> >>> Hi All, >>> >>> I believe we finally have a working EMI2 head node!! After trying many >>> different things, it turns out all I needed to do was make sure the nscd >>> service was running (it isn't by default for my SL5 install). This has >>> allowed 4500 successful Atlas transfers in the last 4 hours and it even >>> seems more performant than the Glite one. I've also had no transfer >>> errors >>> at all :) We are failing nagios tests for some reason but everything else >>> is >>> working so I'm not going to touch it for a day or so before trying to >>> tackle >>> that! >>> >>> Now, onto what I've learnt from the various poking, reinstalling, etc: >>> >>> * The source of the problem definitely seems to be with either lots or >>> slow >>> DNS lookups in the DPM stack >>> >>> * It was magnified massively for us because we are forced to use the >>> Google >>> DNS and didn't have anything in /etc/hosts >>> >>> * Adding the pool nodes and head node to /etc/hosts helped but didn't fix >>> the problem >>> >>> * I therefore guess that what DNS hammering is going on, it is not just >>> for >>> the local machines but remote ones as well >>> >>> * I think having the MySQL DB on the same machine helped a little, but >>> only >>> because this probably took out some more DNS lookup action (this is >>> basically a guess though) >>> >>> * nscd was *not* required for the Glite head node. This old work horse >>> (same >>> hardware as the new one but a different physical box) did not have this >>> service running and ran fine >>> >>> * Having said that it may be that the problem was there, just not as >>> visible >>> as with EMI. In any case, there is a definite change between Glite and >>> EMI1/2. >>> >>> >>> >>> Overall, I would suggest adding as a requirement in the various twikis >>> that >>> the nscd daemon *must* be running on the head node at least (maybe it's >>> best >>> to have it on all service nodes??) and maybe putting a new bug report in >>> to >>> DPM.... >>> >>> For those interested in recreating the problem (and the fun I've been >>> having >>> :)), simply turn off nscd and switch your DNS's to 8.8.8.8 and 4.4.4.4. >>> Within 30-60mins, I'm fairly sure you should start seeing failures in >>> transfers (with Atlas at least) and a look in the logs will show it as >>> well: >>> >>> grep Internal /var/log/dpm/log >>> >>> >>> The plan now is to schedule some DT and reinstall the node cleanly (it's >>> a >>> bit of a mess at the moment what with all the playing around) and also >>> shift >>> the DB back to a remote machine as that made it easier to reinstall the >>> node. >>> >>> But first, I think I might go to the pub.... >>> >>> Thanks for all the help!! >>> >>> Mark >>> > >