JISCMail - LCG-ROLLOUT Archives

Hi all,
here are two more problems with the LCG-1 RB:

1. The condor_schedd has a huge memory leak (fixed in LCG-2), as shown in this
    example on the CERN RB:

   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
21169 edguser    9   0 1979M 587M  1404 S     0.0 29.1 101:23 condor_schedd

Depending on the RB's available memory and swap space, one has to restart
the Job Controller once in a while:

    /etc/init.d/edg-wl-jc stop
    /etc/init.d/edg-wl-jc status        # check that it stopped
    /etc/init.d/edg-wl-jc start
    /etc/init.d/edg-wl-jc status        # check that it started

Note: there is a "restart" command for each of the WL services,
and usually it works, but *not* for the NS (also fixed in LCG-2).

2. It is time to restart the NS when it continuously complains like this:

28 Dec, 14:38:46 - Listener: Exception Caught: Failed to acquire credentials...
28 Dec, 14:54:32 - Listener: Exception Caught: Failed to acquire credentials...
28 Dec, 15:10:17 - Listener: Exception Caught: Failed to acquire credentials...
[etc.]

Having hammered the LCG-2 RB pretty hard for 2 weeks, we still have had no
need to take *any* corrective action...
Cheers,
        Maarten

Maarten Litmaath wrote:

> Hi Martin, et al,
>
> on the LCG-1 system the RB has a few problems that require regular
> manual intervention:
>
> 1. Dependent on the amount of activity in the system, a deadlock may
>    occur between various WP1 daemons, that all try to access the same
>    shared file.  We have seen these cases:
>
>    - LogMonitor (LM), Workload Manager (WM), Network Server (NS);
>    - Job Controller (JC), LM, WM;
>    - LM, WM.
>
>    I will forward a script "check-RB.pl" that I wrote to detect the
>    deadlock and recover from it "automatically" (I even have an RPM
>    for it, but LCG-2 will be installed early January, and LCG-2 no
>    longer has the deadlocks).  The script will also try and restart
>    any service that is absent.
>
> 2. Occasionally the NS becomes "autistic": it is running, but does
>    not accept any requests and does not log any errors; it simply
>    must be restarted.  The only clue is that its logfile does not
>    get updated, so the last entry will appear to be rather old.
>
> 3. The WM has a file descriptor leak.  Find the process of the WM
>    and do an "lsof -p $PID | wc -l" once or twice per day: when it
>    reaches ~1024, the WM must be stopped and restarted.  There will
>    be plenty of error messages in its logfile, when the restart is
>    not done in time.
>
> I will also forward a small script "chk-wl.sh" that shows the status
> of the various daemons: at the top it shows the status of the file
> locks in the system (usually 3 lines on a quiet system; if there are
> many lines, run "check-RB.pl" to recover from a deadlock); next it
> shows one instance of every WP1 process, marking the "fabulous four"
> (JC/LM/NS/WM) for easy recognition; finally, it shows the last 3
> lines of the logfiles of that foursome.
>
> I run that script a couple of times per day on the CERN RB; I run
> "check-RB.pl" as needed.