Hi all,
Since a few days, our resource brokers suffer from frequently crashing
network servers. The crashes occur when a new connection from the UI is
set up, either via an edg-job-submit or an edg-job-get-output. They do
not occur on every connection and are not related to connections by a
specific user or from a specific user interface.
The following fragment from /var/edgwl/networkserver/log/events.log
shows a crash on 17:39:04 when a new connection came in.
19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge.
19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging
created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw
19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request.
19 Apr, 17:38:18 [7] -I- "Manager::run": Command done
19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl
19 Apr, 17:40:11 [0] -F- " NS::main":
--------------------------------------
19 Apr, 17:40:11 [0] -F- " NS::main": Starting Network Server...
19 Apr, 17:40:11 [0] -F- " NSR::drop": Already running in an
unprivileged account...
[.....]
19 Apr, 17:40:11 [0] -F- " NS::main":
--------------------------------------
19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl
19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host
erf.nikhef.nl succeeded for *************************
Tracing the edg-wl-ns daemon indicates that there is a segmentation
violation:
[root@boszwijn root]# strace -Ff -p 3999
Process 3999 attached - interrupt to quit
futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL) = -1 EINTR (Interrupted system call)
+++ killed by SIGSEGV +++
This problem occurs since Tuesday on our three resource brokers that are
available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and
boszwijn.nikhef.nl. The latter is a new machine that was first installed
last Tuesday. The three hosts combine the resource broker with a BDII.
We did not perform middleware upgrades on Tuesday. The package versions
installed on the RB correspond to those specified in the meta package
lcg-RB-3.0.10-1.
The fact that only RBs have this problem probably indicates to "some"
problem in the configuration on our site, but we have run out of ideas
about the possible cause.
Any help is much appreciated.
Thanks,
Ronald
|