Hi Maarten, [log in to unmask] wrote: > On Thu, 19 Apr 2007, Ronald Starink wrote: > >> Hi all, >> >> Since a few days, our resource brokers suffer from frequently crashing >> network servers. The crashes occur when a new connection from the UI is >> set up, either via an edg-job-submit or an edg-job-get-output. They do >> not occur on every connection and are not related to connections by a >> specific user or from a specific user interface. > > But a specific VO or CA? Did you try with an off-site UI as well? Nothing specific: other than dteam, it was also observed for VOs ops, biomed and alice. Same for CAs: DNs from the Netherlands, CERN, Israel, ... > >> The following fragment from /var/edgwl/networkserver/log/events.log >> shows a crash on 17:39:04 when a new connection came in. >> >> 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge. >> 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging >> created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw >> 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request. >> 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done >> 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl >> 19 Apr, 17:40:11 [0] -F- " NS::main": >> -------------------------------------- >> 19 Apr, 17:40:11 [0] -F- " NS::main": Starting Network Server... >> 19 Apr, 17:40:11 [0] -F- " NSR::drop": Already running in an >> unprivileged account... >> [.....] >> 19 Apr, 17:40:11 [0] -F- " NS::main": >> -------------------------------------- >> 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl >> 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host >> erf.nikhef.nl succeeded for ************************* >> >> Tracing the edg-wl-ns daemon indicates that there is a segmentation >> violation: >> >> [root@boszwijn root]# strace -Ff -p 3999 >> Process 3999 attached - interrupt to quit >> futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL) = -1 EINTR (Interrupted system call) >> +++ killed by SIGSEGV +++ >> >> >> This problem occurs since Tuesday on our three resource brokers that are > > Any firewall/router changes? > Not that I know of. >> available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and >> boszwijn.nikhef.nl. The latter is a new machine that was first installed >> last Tuesday. The three hosts combine the resource broker with a BDII. >> >> We did not perform middleware upgrades on Tuesday. The package versions >> installed on the RB correspond to those specified in the meta package >> lcg-RB-3.0.10-1. > > It probably is irrelevant to the problem at hand, but there is evidence > against that last statement. On bosheks.nikhef.nl /var/log shows this: > > ---------------------------------------------------------------------- > -rw-r--r-- 1 root 29798 Apr 19 04:02 rpmpkgs > -rw-r--r-- 1 root 8358 Apr 14 04:02 rpmpkgs.1.gz > ---------------------------------------------------------------------- > > I compared the files: > > ---------------------------------------------------------------------- > $ zcat rpmpkgs.1.gz | diff - rpmpkgs > 194c194 > < condor-6.7.10-1.i386.rpm > --- >> condor-6.6.6-lcg3_sl3.i386.rpm > ---------------------------------------------------------------------- > > The former rpm is part of lcg-RB-3.0.10-1, the latter rpm is not. > Why was it installed? The same code sits in condor-lcgrb-1.0.0-3, > which is still being used, as /opt shows: > > ---------------------------------------------------------------------- > lrwxrwxrwx 1 root 13 Apr 19 11:13 condor -> condor-20.0.7 > ---------------------------------------------------------------------- This was true for bosheks, and it was a mistake that we corrected. But remember there were 3 RBs for which we found this problem. On boswachter, there was no change in the packages: [root@boswachter root]# zcat /var/log/rpmpkgs.1.gz | diff - /var/log/rpmpkgs [root@boswachter root]# and our host boszwijn was freshly installed. Cheers, Ronald