JISCMail - LCG-ROLLOUT Archives

Hi Maarten,

[log in to unmask] wrote:
> On Thu, 19 Apr 2007, Ronald Starink wrote:
> 
>> Hi all,
>>
>> Since a few days, our resource brokers suffer from frequently crashing
>> network servers. The crashes occur when a new connection from the UI is
>> set up, either via an edg-job-submit or an edg-job-get-output. They do
>> not occur on every connection and are not related to connections by a
>> specific user or from a specific user interface.
> 
> But a specific VO or CA?  Did you try with an off-site UI as well?

Nothing specific: other than dteam, it was also observed for VOs ops,
biomed and alice. Same for CAs: DNs from the Netherlands, CERN, Israel, ...

> 
>> The following fragment from /var/edgwl/networkserver/log/events.log
>> shows a crash on 17:39:04 when a new connection came in.
>>
>> 19 Apr, 17:38:16 [7] -I- "CFSI::doPurge": Preparing to Purge.
>> 19 Apr, 17:38:16 [7] -W- "CFSI::doPurge": JobId object for purging
>> created: https://bosheks.nikhef.nl:9000/1tRxIPpaaNiR4Z6vxFIZEw
>> 19 Apr, 17:38:16 [7] -F- "CFSI::LogPurgeJobN": Logging Purge Request.
>> 19 Apr, 17:38:18 [7] -I- "Manager::run": Command done
>> 19 Apr, 17:39:04 [9] -I- "Manager::run": Connection from: erf.nikhef.nl
>> 19 Apr, 17:40:11 [0] -F- "   NS::main":
>> --------------------------------------
>> 19 Apr, 17:40:11 [0] -F- "   NS::main": Starting Network Server...
>> 19 Apr, 17:40:11 [0] -F- "  NSR::drop": Already running in an
>> unprivileged account...
>> [.....]
>> 19 Apr, 17:40:11 [0] -F- "   NS::main":
>> --------------------------------------
>> 19 Apr, 17:40:21 [2] -I- "Manager::run": Connection from: erf.nikhef.nl
>> 19 Apr, 17:40:22 [2] -I- "Manager::run": Authentication with host
>> erf.nikhef.nl succeeded for *************************
>>
>> Tracing the edg-wl-ns daemon indicates that there is a segmentation
>> violation:
>>
>> [root@boszwijn root]# strace -Ff -p 3999
>> Process 3999 attached - interrupt to quit
>> futex(0x9c6fe9c, FUTEX_WAIT, 1, NULL)   = -1 EINTR (Interrupted system call)
>> +++ killed by SIGSEGV +++
>>
>>
>> This problem occurs since Tuesday on our three resource brokers that are
> 
> Any firewall/router changes?
> 

Not that I know of.

>> available for production: bosheks.nikhef.nl, boswachter.nikhef.nl and
>> boszwijn.nikhef.nl. The latter is a new machine that was first installed
>> last Tuesday. The three hosts combine the resource broker with a BDII.
>>
>> We did not perform middleware upgrades on Tuesday. The package versions
>> installed on the RB correspond to those specified in the meta package
>> lcg-RB-3.0.10-1.
> 
> It probably is irrelevant to the problem at hand, but there is evidence
> against that last statement.  On bosheks.nikhef.nl /var/log shows this:
> 
> ----------------------------------------------------------------------
> -rw-r--r--    1 root        29798 Apr 19 04:02 rpmpkgs
> -rw-r--r--    1 root         8358 Apr 14 04:02 rpmpkgs.1.gz
> ----------------------------------------------------------------------
> 
> I compared the files:
> 
> ----------------------------------------------------------------------
> $ zcat rpmpkgs.1.gz | diff - rpmpkgs
> 194c194
> < condor-6.7.10-1.i386.rpm
> ---
>> condor-6.6.6-lcg3_sl3.i386.rpm
> ----------------------------------------------------------------------
> 
> The former rpm is part of lcg-RB-3.0.10-1, the latter rpm is not.
> Why was it installed?  The same code sits in condor-lcgrb-1.0.0-3,
> which is still being used, as /opt shows:
> 
> ----------------------------------------------------------------------
> lrwxrwxrwx    1 root           13 Apr 19 11:13 condor -> condor-20.0.7
> ----------------------------------------------------------------------

This was true for bosheks, and it was a mistake that we corrected. But
remember there were 3 RBs for which we found this problem. On
boswachter, there was no change in the packages:

[root@boswachter root]# zcat /var/log/rpmpkgs.1.gz | diff - /var/log/rpmpkgs
[root@boswachter root]#

and our host boszwijn was freshly installed.

Cheers,
Ronald