JISCMail - LCG-ROLLOUT Archives

On Fri, 28 Jan 2005, Davide Salomoni wrote:

> Hi,
>
> our RB (boswachter.nikhef.nl) became quite unresponsive to job
> submissions lately. /var/log/messages was showing a lot of
>
> Jan 28 09:25:31 boswachter edg-wl-interlogd[4312]: queue_thread:
> event_queue_connect: edg_wll_ssl_connect
> Jan 28 09:25:33 boswachter edg-wl-interlogd[23301]: queue_thread:
> event_queue_connect: edg_wll_ssl_connect
> Jan 28 09:26:31 boswachter edg-wl-interlogd[4312]: queue_thread:
> event_queue_connect: edg_wll_ssl_connect
> Jan 28 09:26:33 boswachter edg-wl-interlogd[23301]: queue_thread:
> event_queue_connect: edg_wll_ssl_connect
> Jan 28 09:26:35 boswachter lcg-bdii:  succeeded
> Jan 28 09:26:35 boswachter last message repeated 2 times
>
> and not much more. edg-job-submit was apparently hanging; strace'ing it
> showed it was blocking on a recv(15, ...), which was linked to a
>
> connect(15, {sin_family=AF_INET, sin_port=htons(7772),
> sin_addr=inet_addr("192.16.186.218")}}, 16) = 0
>
> that is, a read call from boswacther.nikhef.nl:7772 (the RB). netstat on
> boswachter showed quite a few of
>
> [root@boswachter root]# netstat | grep 7772
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59386
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59288
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59284
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59285
> CLOSE_WAIT
> tcp       98      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59286
> ESTABLISHED
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59287
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 hoeve.nikhef.nl:53058
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59213
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 hoeve.nikhef.nl:53148
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 ccedgli05.in2p3.f:59248
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43268
> CLOSE_WAIT
> tcp       98      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:44567
> ESTABLISHED
> tcp        7      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:45180
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43073
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:44973
> CLOSE_WAIT
> tcp        1      0 boswachter.nikhef.:7772 estel.nikhef.nl:51335
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:42380
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:44950
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43167
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43239
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43255
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43254
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:43002
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:45008
> CLOSE_WAIT
> tcp       99      0 boswachter.nikhef.:7772 tbn04.nikhef.nl:42964
> CLOSE_WAIT
>
> The only entries in /var/edgwl/logmonitor/log/events.log on the RB were
>
> 28 Jan, 09:10:23 -D- MonitorLoop::run(): Examining file:
> CondorG.1095744595.log
> 28 Jan, 09:10:23 -D- MonitorLoop::run(): Examining file:
> CondorG.1098888301.log
> [...]
>
> i.e. no "CondorMonitor::processEvent(...): Got job submit event."
> messages or similar.

Port 7772 is of the Network Server: can you have a look at the contents
of /var/edgwl/networkserver/log/events.log around the time it hung?

Which LCG version are you running?

There was a bug in the NS that might cause the behavior you reported:
if an NS client is killed (e.g. interrupted) at the wrong moment,
the thread handling the request hangs forever; by default the NS has
8 threads to handle requests, so after 8 untimely interruptions the NS
will hang completely.  In LCG-2_4_0 there is a timeout of 5 minutes
per connection (to allow for a sandbox transfer over a slow link).

> We restarted the WMS daemons and things seems to be better now, i.e.
> edg-job-submit works again, /var/edgwl/logmonitor/log/events.log shows
> meaningful activity, and so on.
>
> Short of restarting the daemons, is there a better way to
> check/debug/fix what's going on when something like this happens?

With "lsof -i :7772" you would find the processes/threads that have
port 7772 open (i.e. all Network Server threads), and then you might
connect to one of them with "gdb", cycle through all the threads and
capture the stack trace of each one, and send us the results.

I have attached a small script "chk-wl.sh" with which one usually can
assess the situation of the WP1 daemons.  Since the NS, WM, JC and LM
use file locking between them, the output starts with the currently
registered file locks: it should not show any (long-term) contention.
Next it lists one instance of each WP1 program running; NS/WM/JC/LM
are marked with a '<' for easy recognition (in the past it was not
unusual to have one of them missing).  Finally it shows the last 3
lines of the logfiles of the same "fabulous four".