Hi Maxim, I agree that it makes sense to check potential physical causes for problems and not just rule them out. Nevertheless, in this case I think it is highly unlikely that only 3 RBs start giving trouble at the same day and none of the ~300 hosts in the same room. Below are the answers to your questions. >>> 1. Can you please tell where are these servers physically located: >>> Do they share the same room ? Yes. >>> 2. Do they share the same power line ? No. >>> 3. what is the power protection are you using ? I have no idea. >>> 4. Do they share storage ? No. >>> 5. What is the average temperature in the place they're located at ? ~22 Celsius. >>> 6. what's the amount of memory and swap on each of them, what's their >>> loadavg, what does vmstat show ? RAM swap avail/used load avg. bosheks: 3 GB 1 GB / 0 GB 0.48 0.60 0.62 boswachter: 5 GB 5 GB / 0 GB 0.27 0.30 0.28 boszwijn: 8 GB 4 GB / 0 GB 0.33 0.26 0.20 >>> 7. which linux kernel/distribution are you using ? ( sorry for not >>> investigating this via gstat app ) All 3: CentOS 3.7, with kernel 2.4.21-47.ELsmp Cheers, Ronald Maxim Kovgan wrote: > NO, I am not confused. > some memory failures are caused by power, those mem fails are causing > misc ( incl. SIGSEGV ) stuff, > so... the 1st thing is to see the hardware side is ok. > then, low level software, and only then I'd look further... > many places to look for. > > And since we still have no clear understanding of what is causing the > daemons to segfault... we better try any new ideas, ( assuming you've > googled yourself to the point you know it's not a known issue, and I > think you're a serious person who does this... ) > > And, these ideas don't cost much. > > Do we agree on this ? > > I mean, when something _wierd_ and time sporadic is happening, I always > start diagnosing from the lowest level possible: power, etc. > From the data presented I have not negated this possibility. > > Best regards, > > Max. > > > Jeff Templon wrote: >> Hi Maxim, >> >> I think you're following a false lead. The "network server" is a >> piece of software, one of the daemon pieces of the LCG-RB. It's not a >> "network server machine" like you are probably thinking. The LCG-RB >> machine as a piece of hardware works fine, the daemon, called "network >> server", on this machine, is crashing. >> >> JT >> >> Maxim Kovgan wrote: >>> 1. Can you please tell where are these servers physically located: >>> Do they share the same room ? >>> 2. Do they share the same power line ? >>> 3. what is the power protection are you using ? >>> 4. Do they share storage ? >>> 5. What is the average temperature in the place they're located at ? >>> 6. what's the amount of memory and swap on each of them, what's their >>> loadavg, what does vmstat show ? >>> 7. which linux kernel/distribution are you using ? ( sorry for not >>> investigating this via gstat app ) >>> >>> Regards. >>> >>> >>> >>> Ronald Starink wrote: >>>> Hi all, >>>> >>>> Since a few days, our resource brokers suffer from frequently crashing >>>> network servers. The crashes occur when a new connection from the UI is >>>> set up, either via an edg-job-submit or an edg-job-get-output. They do >>>> not occur on every connection and are not related to connections by a >>>> specific user or from a specific user interface. >>>> >>>> The following fragment from /var/edgwl/networkserver/log/events.log >>>> shows a crash on 17:39:04 when a new connection came in. >>>> >>>> 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge. >>>> 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging >>>> created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw >>>> 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request. >>>> 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done >>>> 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl >>>> 19 Apr, 17:40:11 [0] -F- " NS::main": >>>> -------------------------------------- >>>> 19 Apr, 17:40:11 [0] -F- " NS::main": Starting Network Server... >>>> 19 Apr, 17:40:11 [0] -F- " NSR::drop": Already running in an >>>> unprivileged account... >>>> [.....] >>>> 19 Apr, 17:40:11 [0] -F- " NS::main": >>>> -------------------------------------- >>>> 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl >>>> 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host >>>> erf.nikhef.nl succeeded for ************************* >>>> >>>> Tracing the edg-wl-ns daemon indicates that there is a segmentation >>>> violation: >>>> >>>> [root@boszwijn root]# strace -Ff -p 3999 >>>> Process 3999 attached - interrupt to quit >>>> futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL) = -1 EINTR (Interrupted >>>> system call) >>>> +++ killed by SIGSEGV +++ >>>> >>>> >>>> This problem occurs since Tuesday on our three resource brokers that >>>> are >>>> available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and >>>> boszwijn.nikhef.nl. The latter is a new machine that was first >>>> installed >>>> last Tuesday. The three hosts combine the resource broker with a BDII. >>>> >>>> We did not perform middleware upgrades on Tuesday. The package versions >>>> installed on the RB correspond to those specified in the meta package >>>> lcg-RB-3.0.10-1. >>>> >>>> The fact that only RBs have this problem probably indicates to "some" >>>> problem in the configuration on our site, but we have run out of ideas >>>> about the possible cause. >>>> >>>> Any help is much appreciated. >>>> >>>> Thanks, >>>> Ronald