Hi, SYSTEMS AFFECTED: lcgrb01.gridpp.rl.ac.uk - RAL - LCG -RB ACTION(S) Due to deadlock in http://marianne.in2p3.fr/datagrid/bugzilla/show_bug.cgi?id=2054 the following was done. --------------------------- /etc/init.d/edg-wl-jc stop /etc/init.d/edg-wl-lm stop /etc/init.d/edg-wl-wm stop /etc/init.d/edg-wl-ns stop rm /tmp/workload_manager/input.fl (Not exactly sure what the consequences of this are but it does the job.) /etc/init.d/edg-wl-jc start /etc/init.d/edg-wl-lm start /etc/init.d/edg-wl-wm start sleep 60 /etc/init.d/edg-wl-ns start --------------------------- WP1 has already fixed this for real and will appear with LCG2 RB. PROBLEMS RESOLVED RAL RB operational again. On Wed, 10 Dec 2003, Bly, MJ (Martin) wrote: > The problem has not resolved overnight but the I didn't expect it to. > > I need some answers on this if the sites relying on the RAL are going to get > any sort of service any time soon. > > Reinstallation = defeat. > > Martin. > -- > ------------------------------------------------------- > Martin Bly | +44 1235 446981 | [log in to unmask] > Systems Admin, Tier 1/A Service, RAL PPD CSG > ------------------------------------------------------- > > > -----Original Message----- > > From: Martin Bly [mailto:[log in to unmask]] > > Sent: Tuesday, December 09, 2003 5:32 PM > > To: [log in to unmask] > > Cc: Martin Bly > > Subject: RAL RB very sick. > > > > > > The rAL RB is somewhat sick, despite having some ports opened in our > > firewall. The contents of > > /opt/globus/var/condor/log/Schedlog for example: > > > > 12/9 17:14:14 ****************************************************** > > 12/9 17:14:14 ** condor_schedd (CONDOR_SCHEDD) STARTING UP > > 12/9 17:14:14 ** $CondorVersion: 6.5.3 Jun 16 2003 PRE-RELEASE $ > > 12/9 17:14:14 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $ > > 12/9 17:14:14 ** PID = 7302 > > 12/9 17:14:14 ****************************************************** > > 12/9 17:14:14 Using config file: /opt/condor/etc/condor.conf > > 12/9 17:14:15 DaemonCore: Command Socket at <130.246.183.184:32772> > > 12/9 17:14:15 "/opt/condor/sbin/condor_shadow -classad" did > > not produce any > > output, ignoring > > 12/9 17:14:15 "/opt/condor/sbin/condor_shadow.pvm -classad" > > did not produce > > any output, ignoring > > 12/9 17:14:15 "/opt/condor/sbin/condor_shadow.std -classad" > > did not produce > > any output, ignoring > > 12/9 17:16:32 Sent ad to central manager for > > [log in to unmask] > > 12/9 17:16:32 Removed old scratch dir > > /tmp/condor_g_scratch.0xb2c9ea0.14366 > > 12/9 17:16:33 Removed old scratch dir > > /tmp/condor_g_scratch.0xb2ca040.14366 > > 12/9 17:16:33 Started condor_gmanager for owner edguser pid=7615 > > 12/9 17:16:33 Started condor_gmanager for owner edguser pid=7616 > > 12/9 17:16:33 condor_write(): Socket closed when trying to > > write buffer > > 12/9 17:16:33 Buf::write(): condor_write() failed > > 12/9 17:16:33 SECMAN: Error sending response classad! > > 12/9 17:16:33 DaemonCore: Command received via TCP from host > > <130.246.183.184:50031> > > 12/9 17:16:33 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:16:34 DaemonCore: Command received via TCP from host > > <130.246.183.184:50086> > > 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:16:34 DaemonCore: Command received via TCP from host > > <130.246.183.184:50087> > > 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:16:34 DaemonCore: Command received via TCP from host > > <130.246.183.184:50090> > > 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:16:35 DaemonCore: Command received via TCP from host > > <130.246.183.184:50092> > > 12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:16:35 DaemonCore: Command received via TCP from host > > <130.246.183.184:50093> > > 12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:22:10 Sent ad to central manager for > > [log in to unmask] > > 12/9 17:22:10 condor_write(): Socket closed when trying to > > write buffer > > 12/9 17:22:10 Buf::write(): condor_write() failed > > 12/9 17:22:10 SECMAN: Error sending response classad! > > 12/9 17:22:10 condor_write(): Socket closed when trying to > > write buffer > > 12/9 17:22:10 Buf::write(): condor_write() failed > > 12/9 17:22:10 AUTHENTICATE: handshake failed! > > 12/9 17:22:10 DaemonCore: Command received via TCP from host > > <130.246.183.184:50130> > > 12/9 17:22:10 DaemonCore: received command 478 (ACT_ON_JOBS), calling > > handler (actOnJobs) > > 12/9 17:22:10 condor_write(): Socket closed when trying to > > write buffer > > 12/9 17:22:10 Buf::write(): condor_write() failed > > 12/9 17:22:10 AUTHENTICATE: handshake failed! > > 12/9 17:22:10 condor_write(): Socket closed when trying to > > write buffer > > 12/9 17:22:10 Buf::write(): condor_write() failed > > 12/9 17:22:10 AUTHENTICATE: handshake failed! > > [root@lcgrb01 log]# > > > > The edg-wl-bkserverd processes are filling /var/log/messages with: > > > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7316]: File exists > > (duplicate event) > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file > > or directory > > (job not registered) > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file > > or directory > > (job not registered) > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7307]: No such file > > or directory > > (job not registered) > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7309]: No such file > > or directory > > (job not registered) > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file > > or directory > > (job not registered) > > Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7321]: No such file > > or directory > > (job not registered) > > > > several times a second. > > > > I predict that the system will be full by the morning and therefore > > unuseable if it isn't already. Clearly I'd like to turn it > > off and redeploy > > the box into our main batch system but that would screw up > > the LCG so has > > anyone got any ideas about this? > > > > Martin. > > > -- Steve Traylen [log in to unmask] http://www.gridpp.ac.uk/