Hi all, here are two more problems with the LCG-1 RB: 1. The condor_schedd has a huge memory leak (fixed in LCG-2), as shown in this example on the CERN RB: PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 21169 edguser 9 0 1979M 587M 1404 S 0.0 29.1 101:23 condor_schedd Depending on the RB's available memory and swap space, one has to restart the Job Controller once in a while: /etc/init.d/edg-wl-jc stop /etc/init.d/edg-wl-jc status # check that it stopped /etc/init.d/edg-wl-jc start /etc/init.d/edg-wl-jc status # check that it started Note: there is a "restart" command for each of the WL services, and usually it works, but *not* for the NS (also fixed in LCG-2). 2. It is time to restart the NS when it continuously complains like this: 28 Dec, 14:38:46 - Listener: Exception Caught: Failed to acquire credentials... 28 Dec, 14:54:32 - Listener: Exception Caught: Failed to acquire credentials... 28 Dec, 15:10:17 - Listener: Exception Caught: Failed to acquire credentials... [etc.] Having hammered the LCG-2 RB pretty hard for 2 weeks, we still have had no need to take *any* corrective action... Cheers, Maarten Maarten Litmaath wrote: > Hi Martin, et al, > > on the LCG-1 system the RB has a few problems that require regular > manual intervention: > > 1. Dependent on the amount of activity in the system, a deadlock may > occur between various WP1 daemons, that all try to access the same > shared file. We have seen these cases: > > - LogMonitor (LM), Workload Manager (WM), Network Server (NS); > - Job Controller (JC), LM, WM; > - LM, WM. > > I will forward a script "check-RB.pl" that I wrote to detect the > deadlock and recover from it "automatically" (I even have an RPM > for it, but LCG-2 will be installed early January, and LCG-2 no > longer has the deadlocks). The script will also try and restart > any service that is absent. > > 2. Occasionally the NS becomes "autistic": it is running, but does > not accept any requests and does not log any errors; it simply > must be restarted. The only clue is that its logfile does not > get updated, so the last entry will appear to be rather old. > > 3. The WM has a file descriptor leak. Find the process of the WM > and do an "lsof -p $PID | wc -l" once or twice per day: when it > reaches ~1024, the WM must be stopped and restarted. There will > be plenty of error messages in its logfile, when the restart is > not done in time. > > I will also forward a small script "chk-wl.sh" that shows the status > of the various daemons: at the top it shows the status of the file > locks in the system (usually 3 lines on a quiet system; if there are > many lines, run "check-RB.pl" to recover from a deadlock); next it > shows one instance of every WP1 process, marking the "fabulous four" > (JC/LM/NS/WM) for easy recognition; finally, it shows the last 3 > lines of the logfiles of that foursome. > > I run that script a couple of times per day on the CERN RB; I run > "check-RB.pl" as needed.