Print

Print


1. Can you please tell where are these servers physically located:
Do they share the same room ?
2. Do they share the same power line ?
3. what is the power protection are you using ?
4. Do they share storage ?
5. What is the average temperature in the place they're located at ?
6. what's the amount of memory and swap on each of them, what's their 
loadavg, what does vmstat show ?
7. which linux kernel/distribution are you using ? ( sorry for not 
investigating this via gstat app )

Regards.



Ronald Starink wrote:
> Hi all,
> 
> Since a few days, our resource brokers suffer from frequently crashing
> network servers. The crashes occur when a new connection from the UI is
> set up, either via an edg-job-submit or an edg-job-get-output. They do
> not occur on every connection and are not related to connections by a
> specific user or from a specific user interface.
> 
> The following fragment from /var/edgwl/networkserver/log/events.log
> shows a crash on 17:39:04 when a new connection came in.
> 
> 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge.
> 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging
> created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw
> 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request.
> 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done
> 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl
> 19 Apr, 17:40:11 [0] -F- "   NS::main":
> --------------------------------------
> 19 Apr, 17:40:11 [0] -F- "   NS::main": Starting Network Server...
> 19 Apr, 17:40:11 [0] -F- "  NSR::drop": Already running in an
> unprivileged account...
> [.....]
> 19 Apr, 17:40:11 [0] -F- "   NS::main":
> --------------------------------------
> 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl
> 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host
> erf.nikhef.nl succeeded for *************************
> 
> Tracing the edg-wl-ns daemon indicates that there is a segmentation
> violation:
> 
> [root@boszwijn root]# strace -Ff -p 3999
> Process 3999 attached - interrupt to quit
> futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL)   = -1 EINTR (Interrupted system call)
> +++ killed by SIGSEGV +++
> 
> 
> This problem occurs since Tuesday on our three resource brokers that are
> available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and
> boszwijn.nikhef.nl. The latter is a new machine that was first installed
> last Tuesday. The three hosts combine the resource broker with a BDII.
> 
> We did not perform middleware upgrades on Tuesday. The package versions
> installed on the RB correspond to those specified in the meta package
> lcg-RB-3.0.10-1.
> 
> The fact that only RBs have this problem probably indicates to "some"
> problem in the configuration on our site, but we have run out of ideas
> about the possible cause.
> 
> Any help is much appreciated.
> 
> Thanks,
> Ronald