Hi Martin, et al,
on the LCG-1 system the RB has a few problems that require regular
manual intervention:
1. Dependent on the amount of activity in the system, a deadlock may
occur between various WP1 daemons, that all try to access the same
shared file. We have seen these cases:
- LogMonitor (LM), Workload Manager (WM), Network Server (NS);
- Job Controller (JC), LM, WM;
- LM, WM.
I will forward a script "check-RB.pl" that I wrote to detect the
deadlock and recover from it "automatically" (I even have an RPM
for it, but LCG-2 will be installed early January, and LCG-2 no
longer has the deadlocks). The script will also try and restart
any service that is absent.
2. Occasionally the NS becomes "autistic": it is running, but does
not accept any requests and does not log any errors; it simply
must be restarted. The only clue is that its logfile does not
get updated, so the last entry will appear to be rather old.
3. The WM has a file descriptor leak. Find the process of the WM
and do an "lsof -p $PID | wc -l" once or twice per day: when it
reaches ~1024, the WM must be stopped and restarted. There will
be plenty of error messages in its logfile, when the restart is
not done in time.
I will also forward a small script "chk-wl.sh" that shows the status
of the various daemons: at the top it shows the status of the file
locks in the system (usually 3 lines on a quiet system; if there are
many lines, run "check-RB.pl" to recover from a deadlock); next it
shows one instance of every WP1 process, marking the "fabulous four"
(JC/LM/NS/WM) for easy recognition; finally, it shows the last 3
lines of the logfiles of that foursome.
I run that script a couple of times per day on the CERN RB; I run
"check-RB.pl" as needed.
|