1. Can you please tell where are these servers physically located: Do they share the same room ? 2. Do they share the same power line ? 3. what is the power protection are you using ? 4. Do they share storage ? 5. What is the average temperature in the place they're located at ? 6. what's the amount of memory and swap on each of them, what's their loadavg, what does vmstat show ? 7. which linux kernel/distribution are you using ? ( sorry for not investigating this via gstat app ) Regards. Ronald Starink wrote: > Hi all, > > Since a few days, our resource brokers suffer from frequently crashing > network servers. The crashes occur when a new connection from the UI is > set up, either via an edg-job-submit or an edg-job-get-output. They do > not occur on every connection and are not related to connections by a > specific user or from a specific user interface. > > The following fragment from /var/edgwl/networkserver/log/events.log > shows a crash on 17:39:04 when a new connection came in. > > 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge. > 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging > created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw > 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request. > 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done > 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl > 19 Apr, 17:40:11 [0] -F- " NS::main": > -------------------------------------- > 19 Apr, 17:40:11 [0] -F- " NS::main": Starting Network Server... > 19 Apr, 17:40:11 [0] -F- " NSR::drop": Already running in an > unprivileged account... > [.....] > 19 Apr, 17:40:11 [0] -F- " NS::main": > -------------------------------------- > 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl > 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host > erf.nikhef.nl succeeded for ************************* > > Tracing the edg-wl-ns daemon indicates that there is a segmentation > violation: > > [root@boszwijn root]# strace -Ff -p 3999 > Process 3999 attached - interrupt to quit > futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL) = -1 EINTR (Interrupted system call) > +++ killed by SIGSEGV +++ > > > This problem occurs since Tuesday on our three resource brokers that are > available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and > boszwijn.nikhef.nl. The latter is a new machine that was first installed > last Tuesday. The three hosts combine the resource broker with a BDII. > > We did not perform middleware upgrades on Tuesday. The package versions > installed on the RB correspond to those specified in the meta package > lcg-RB-3.0.10-1. > > The fact that only RBs have this problem probably indicates to "some" > problem in the configuration on our site, but we have run out of ideas > about the possible cause. > > Any help is much appreciated. > > Thanks, > Ronald