JISCMail - TB-SUPPORT Archives

Hi Mark,

Yes we using Uni's one.

Cheers
Govind

On Fri, Jan 4, 2013 at 1:53 PM, Mark Slater <[log in to unmask]> wrote:
> Hi Govind,
>
> But do you have a fast DNS lookup? That's probably where we were getting hit
> by it. If you're using a reliable uni one then it's probably fine but if you
> can't or the DNS is a little slow, then that's where I think the problems
> start.
>
> Thanks,
>
> Mark
>
>
> On 04/01/13 13:49, Govind Songara wrote:
>>
>> My headnode is running fine for last 45 days without any problem with
>> *nscd* service off. So I think it is not needed.
>>
>> I do have all pool node listed in /etc/hosts file.
>>
>>
>> On Fri, Jan 4, 2013 at 1:42 PM, Mark Slater <[log in to unmask]> wrote:
>>>
>>> Hi All,
>>>
>>> I believe we finally have a working EMI2 head node!! After trying many
>>> different things, it turns out all I needed to do was make sure the nscd
>>> service was running (it isn't by default for my SL5 install). This has
>>> allowed 4500 successful Atlas transfers in the last 4 hours and it even
>>> seems more performant than the Glite one. I've also had no transfer
>>> errors
>>> at all :) We are failing nagios tests for some reason but everything else
>>> is
>>> working so I'm not going to touch it for a day or so before trying to
>>> tackle
>>> that!
>>>
>>> Now, onto what I've learnt from the various poking, reinstalling, etc:
>>>
>>> * The source of the problem definitely seems to be with either lots or
>>> slow
>>> DNS lookups in the DPM stack
>>>
>>> * It was magnified massively for us because we are forced to use the
>>> Google
>>> DNS and didn't have anything in /etc/hosts
>>>
>>> * Adding the pool nodes and head node to /etc/hosts helped but didn't fix
>>> the problem
>>>
>>> * I therefore guess that what DNS hammering is going on, it is not just
>>> for
>>> the local machines but remote ones as well
>>>
>>> * I think having the MySQL DB on the same machine helped a little, but
>>> only
>>> because this probably took out some more DNS lookup action (this is
>>> basically a guess though)
>>>
>>> * nscd was *not* required for the Glite head node. This old work horse
>>> (same
>>> hardware as the new one but a different physical box) did not have this
>>> service running and ran fine
>>>
>>> * Having said that it may be that the problem was there, just not as
>>> visible
>>> as with EMI. In any case, there is a definite change between Glite and
>>> EMI1/2.
>>>
>>>
>>>
>>> Overall, I would suggest adding as a requirement in the various twikis
>>> that
>>> the nscd daemon *must* be running on the head node at least (maybe it's
>>> best
>>> to have it on all service nodes??) and maybe putting a new bug report in
>>> to
>>> DPM....
>>>
>>> For those interested in recreating the problem (and the fun I've been
>>> having
>>> :)), simply turn off nscd and switch your DNS's to 8.8.8.8 and 4.4.4.4.
>>> Within 30-60mins, I'm fairly sure you should start seeing failures in
>>> transfers (with Atlas at least) and a look in the logs will show it as
>>> well:
>>>
>>> grep Internal /var/log/dpm/log
>>>
>>>
>>> The plan now is to schedule some DT and reinstall the node cleanly (it's
>>> a
>>> bit of a mess at the moment what with all the playing around) and also
>>> shift
>>> the DB back to a remote machine as that made it easier to reinstall the
>>> node.
>>>
>>> But first, I think I might go to the pub....
>>>
>>> Thanks for all the help!!
>>>
>>> Mark
>>>
>
>