Print

Print


Hi,

SYSTEMS AFFECTED:
lcgrb01.gridpp.rl.ac.uk - RAL - LCG -RB

ACTION(S)
Due to deadlock in
http://marianne.in2p3.fr/datagrid/bugzilla/show_bug.cgi?id=2054

the following was done.

---------------------------
/etc/init.d/edg-wl-jc stop
/etc/init.d/edg-wl-lm stop
/etc/init.d/edg-wl-wm stop
/etc/init.d/edg-wl-ns stop
rm  /tmp/workload_manager/input.fl
(Not exactly sure what the consequences of this are but it does
 the job.)
/etc/init.d/edg-wl-jc start
/etc/init.d/edg-wl-lm start
/etc/init.d/edg-wl-wm start
sleep 60
/etc/init.d/edg-wl-ns start
---------------------------

WP1 has already fixed this for real and will appear with
LCG2 RB.

PROBLEMS RESOLVED
RAL RB operational again.


On Wed, 10 Dec 2003, Bly, MJ (Martin) wrote:

> The problem has not resolved overnight but the I didn't expect it to.
>
> I need some answers on this if the sites relying on the RAL are going to get
> any sort of service any time soon.
>
> Reinstallation = defeat.
>
> Martin.
> --
>   -------------------------------------------------------
>     Martin Bly  |  +44 1235 446981  |  [log in to unmask]
>        Systems Admin, Tier 1/A Service,  RAL PPD CSG
>   -------------------------------------------------------
>
> > -----Original Message-----
> > From: Martin Bly [mailto:[log in to unmask]]
> > Sent: Tuesday, December 09, 2003 5:32 PM
> > To: [log in to unmask]
> > Cc: Martin Bly
> > Subject: RAL RB very sick.
> >
> >
> > The rAL RB is somewhat sick, despite having some ports opened in our
> > firewall.  The contents of
> > /opt/globus/var/condor/log/Schedlog for example:
> >
> > 12/9 17:14:14 ******************************************************
> > 12/9 17:14:14 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> > 12/9 17:14:14 ** $CondorVersion: 6.5.3 Jun 16 2003 PRE-RELEASE $
> > 12/9 17:14:14 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
> > 12/9 17:14:14 ** PID = 7302
> > 12/9 17:14:14 ******************************************************
> > 12/9 17:14:14 Using config file: /opt/condor/etc/condor.conf
> > 12/9 17:14:15 DaemonCore: Command Socket at <130.246.183.184:32772>
> > 12/9 17:14:15 "/opt/condor/sbin/condor_shadow -classad" did
> > not produce any
> > output, ignoring
> > 12/9 17:14:15 "/opt/condor/sbin/condor_shadow.pvm -classad"
> > did not produce
> > any output, ignoring
> > 12/9 17:14:15 "/opt/condor/sbin/condor_shadow.std -classad"
> > did not produce
> > any output, ignoring
> > 12/9 17:16:32 Sent ad to central manager for
> > [log in to unmask]
> > 12/9 17:16:32 Removed old scratch dir
> > /tmp/condor_g_scratch.0xb2c9ea0.14366
> > 12/9 17:16:33 Removed old scratch dir
> > /tmp/condor_g_scratch.0xb2ca040.14366
> > 12/9 17:16:33 Started condor_gmanager for owner edguser pid=7615
> > 12/9 17:16:33 Started condor_gmanager for owner edguser pid=7616
> > 12/9 17:16:33 condor_write(): Socket closed when trying to
> > write buffer
> > 12/9 17:16:33 Buf::write(): condor_write() failed
> > 12/9 17:16:33 SECMAN: Error sending response classad!
> > 12/9 17:16:33 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50031>
> > 12/9 17:16:33 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:16:34 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50086>
> > 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:16:34 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50087>
> > 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:16:34 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50090>
> > 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:16:35 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50092>
> > 12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:16:35 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50093>
> > 12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:22:10 Sent ad to central manager for
> > [log in to unmask]
> > 12/9 17:22:10 condor_write(): Socket closed when trying to
> > write buffer
> > 12/9 17:22:10 Buf::write(): condor_write() failed
> > 12/9 17:22:10 SECMAN: Error sending response classad!
> > 12/9 17:22:10 condor_write(): Socket closed when trying to
> > write buffer
> > 12/9 17:22:10 Buf::write(): condor_write() failed
> > 12/9 17:22:10 AUTHENTICATE: handshake failed!
> > 12/9 17:22:10 DaemonCore: Command received via TCP from host
> > <130.246.183.184:50130>
> > 12/9 17:22:10 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> > handler (actOnJobs)
> > 12/9 17:22:10 condor_write(): Socket closed when trying to
> > write buffer
> > 12/9 17:22:10 Buf::write(): condor_write() failed
> > 12/9 17:22:10 AUTHENTICATE: handshake failed!
> > 12/9 17:22:10 condor_write(): Socket closed when trying to
> > write buffer
> > 12/9 17:22:10 Buf::write(): condor_write() failed
> > 12/9 17:22:10 AUTHENTICATE: handshake failed!
> > [root@lcgrb01 log]#
> >
> > The edg-wl-bkserverd processes are filling /var/log/messages with:
> >
> >  Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[7316]: File exists
> > (duplicate event)
> > Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file
> > or directory
> > (job not registered)
> > Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file
> > or directory
> > (job not registered)
> > Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[7307]: No such file
> > or directory
> > (job not registered)
> > Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[7309]: No such file
> > or directory
> > (job not registered)
> > Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file
> > or directory
> > (job not registered)
> > Dec  9 17:29:46 lcgrb01 edg-wl-bkserverd[7321]: No such file
> > or directory
> > (job not registered)
> >
> > several times a second.
> >
> > I predict that the system will be full by the morning and therefore
> > unuseable if it isn't already.  Clearly I'd like to turn it
> > off and redeploy
> > the box into our main batch system but that would screw up
> > the LCG so has
> > anyone got any ideas about this?
> >
> > Martin.
> >
>

--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/