JISCMail - LCG-ROLLOUT Archives

Hi Maxim,

I agree that it makes sense to check potential physical causes for
problems and not just rule them out. Nevertheless, in this case I think
it is highly unlikely that only 3 RBs start giving trouble at the same
day and none of the ~300 hosts in the same room.

Below are the answers to your questions.

>>> 1. Can you please tell where are these servers physically located:
>>> Do they share the same room ?
Yes.

>>> 2. Do they share the same power line ?
No.

>>> 3. what is the power protection are you using ?
I have no idea.

>>> 4. Do they share storage ?
No.

>>> 5. What is the average temperature in the place they're located at ?
~22 Celsius.

>>> 6. what's the amount of memory and swap on each of them, what's their
>>> loadavg, what does vmstat show ?
                RAM     swap avail/used    load avg.
bosheks:        3 GB    1 GB / 0 GB        0.48 0.60 0.62
boswachter:     5 GB    5 GB / 0 GB        0.27 0.30 0.28
boszwijn:       8 GB    4 GB / 0 GB        0.33 0.26 0.20

>>> 7. which linux kernel/distribution are you using ? ( sorry for not
>>> investigating this via gstat app )
All 3: CentOS 3.7, with kernel 2.4.21-47.ELsmp


Cheers,
Ronald




Maxim Kovgan wrote:
> NO, I am not confused.
> some memory failures are caused by power, those mem fails are causing
> misc ( incl. SIGSEGV ) stuff,
> so... the 1st thing is to see the hardware side is ok.
> then, low level software, and only then I'd look further...
> many places to look for.
> 
> And since we still have no clear understanding of what is causing the
> daemons to segfault... we better try any new ideas, ( assuming you've
> googled yourself to the point you know it's not a known issue, and I
> think you're a serious person who does this... )
> 
> And, these ideas don't cost much.
> 
> Do we agree on this ?
> 
> I mean, when something _wierd_ and time sporadic is happening, I always
> start diagnosing from the lowest level possible: power, etc.
> From the data presented I have not negated this possibility.
> 
> Best regards,
> 
> Max.
> 
> 
> Jeff Templon wrote:
>> Hi Maxim,
>>
>> I think you're following a false lead.  The "network server" is a
>> piece of software, one of the daemon pieces of the LCG-RB.  It's not a
>> "network server machine" like you are probably thinking.   The LCG-RB
>> machine as a piece of hardware works fine, the daemon, called "network
>> server", on this machine, is crashing.
>>
>>                     JT
>>
>> Maxim Kovgan wrote:
>>> 1. Can you please tell where are these servers physically located:
>>> Do they share the same room ?
>>> 2. Do they share the same power line ?
>>> 3. what is the power protection are you using ?
>>> 4. Do they share storage ?
>>> 5. What is the average temperature in the place they're located at ?
>>> 6. what's the amount of memory and swap on each of them, what's their
>>> loadavg, what does vmstat show ?
>>> 7. which linux kernel/distribution are you using ? ( sorry for not
>>> investigating this via gstat app )
>>>
>>> Regards.
>>>
>>>
>>>
>>> Ronald Starink wrote:
>>>> Hi all,
>>>>
>>>> Since a few days, our resource brokers suffer from frequently crashing
>>>> network servers. The crashes occur when a new connection from the UI is
>>>> set up, either via an edg-job-submit or an edg-job-get-output. They do
>>>> not occur on every connection and are not related to connections by a
>>>> specific user or from a specific user interface.
>>>>
>>>> The following fragment from /var/edgwl/networkserver/log/events.log
>>>> shows a crash on 17:39:04 when a new connection came in.
>>>>
>>>> 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge.
>>>> 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging
>>>> created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw
>>>> 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request.
>>>> 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done
>>>> 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl
>>>> 19 Apr, 17:40:11 [0] -F- "   NS::main":
>>>> --------------------------------------
>>>> 19 Apr, 17:40:11 [0] -F- "   NS::main": Starting Network Server...
>>>> 19 Apr, 17:40:11 [0] -F- "  NSR::drop": Already running in an
>>>> unprivileged account...
>>>> [.....]
>>>> 19 Apr, 17:40:11 [0] -F- "   NS::main":
>>>> --------------------------------------
>>>> 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl
>>>> 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host
>>>> erf.nikhef.nl succeeded for *************************
>>>>
>>>> Tracing the edg-wl-ns daemon indicates that there is a segmentation
>>>> violation:
>>>>
>>>> [root@boszwijn root]# strace -Ff -p 3999
>>>> Process 3999 attached - interrupt to quit
>>>> futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL)   = -1 EINTR (Interrupted
>>>> system call)
>>>> +++ killed by SIGSEGV +++
>>>>
>>>>
>>>> This problem occurs since Tuesday on our three resource brokers that
>>>> are
>>>> available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and
>>>> boszwijn.nikhef.nl. The latter is a new machine that was first
>>>> installed
>>>> last Tuesday. The three hosts combine the resource broker with a BDII.
>>>>
>>>> We did not perform middleware upgrades on Tuesday. The package versions
>>>> installed on the RB correspond to those specified in the meta package
>>>> lcg-RB-3.0.10-1.
>>>>
>>>> The fact that only RBs have this problem probably indicates to "some"
>>>> problem in the configuration on our site, but we have run out of ideas
>>>> about the possible cause.
>>>>
>>>> Any help is much appreciated.
>>>>
>>>> Thanks,
>>>> Ronald