Hi Maxim,
I think you're following a false lead. The "network server" is a piece
of software, one of the daemon pieces of the LCG-RB. It's not a
"network server machine" like you are probably thinking. The LCG-RB
machine as a piece of hardware works fine, the daemon, called "network
server", on this machine, is crashing.
JT
Maxim Kovgan wrote:
> 1. Can you please tell where are these servers physically located:
> Do they share the same room ?
> 2. Do they share the same power line ?
> 3. what is the power protection are you using ?
> 4. Do they share storage ?
> 5. What is the average temperature in the place they're located at ?
> 6. what's the amount of memory and swap on each of them, what's their
> loadavg, what does vmstat show ?
> 7. which linux kernel/distribution are you using ? ( sorry for not
> investigating this via gstat app )
>
> Regards.
>
>
>
> Ronald Starink wrote:
>> Hi all,
>>
>> Since a few days, our resource brokers suffer from frequently crashing
>> network servers. The crashes occur when a new connection from the UI is
>> set up, either via an edg-job-submit or an edg-job-get-output. They do
>> not occur on every connection and are not related to connections by a
>> specific user or from a specific user interface.
>>
>> The following fragment from /var/edgwl/networkserver/log/events.log
>> shows a crash on 17:39:04 when a new connection came in.
>>
>> 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge.
>> 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging
>> created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw
>> 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request.
>> 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done
>> 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl
>> 19 Apr, 17:40:11 [0] -F- " NS::main":
>> --------------------------------------
>> 19 Apr, 17:40:11 [0] -F- " NS::main": Starting Network Server...
>> 19 Apr, 17:40:11 [0] -F- " NSR::drop": Already running in an
>> unprivileged account...
>> [.....]
>> 19 Apr, 17:40:11 [0] -F- " NS::main":
>> --------------------------------------
>> 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl
>> 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host
>> erf.nikhef.nl succeeded for *************************
>>
>> Tracing the edg-wl-ns daemon indicates that there is a segmentation
>> violation:
>>
>> [root@boszwijn root]# strace -Ff -p 3999
>> Process 3999 attached - interrupt to quit
>> futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL) = -1 EINTR (Interrupted system
>> call)
>> +++ killed by SIGSEGV +++
>>
>>
>> This problem occurs since Tuesday on our three resource brokers that are
>> available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and
>> boszwijn.nikhef.nl. The latter is a new machine that was first installed
>> last Tuesday. The three hosts combine the resource broker with a BDII.
>>
>> We did not perform middleware upgrades on Tuesday. The package versions
>> installed on the RB correspond to those specified in the meta package
>> lcg-RB-3.0.10-1.
>>
>> The fact that only RBs have this problem probably indicates to "some"
>> problem in the configuration on our site, but we have run out of ideas
>> about the possible cause.
>>
>> Any help is much appreciated.
>>
>> Thanks,
>> Ronald
|