The problem has not resolved overnight but the I didn't expect it to.
I need some answers on this if the sites relying on the RAL are going to get
any sort of service any time soon.
Reinstallation = defeat.
Martin.
--
-------------------------------------------------------
Martin Bly | +44 1235 446981 | [log in to unmask]
Systems Admin, Tier 1/A Service, RAL PPD CSG
-------------------------------------------------------
> -----Original Message-----
> From: Martin Bly [mailto:[log in to unmask]]
> Sent: Tuesday, December 09, 2003 5:32 PM
> To: [log in to unmask]
> Cc: Martin Bly
> Subject: RAL RB very sick.
>
>
> The rAL RB is somewhat sick, despite having some ports opened in our
> firewall. The contents of
> /opt/globus/var/condor/log/Schedlog for example:
>
> 12/9 17:14:14 ******************************************************
> 12/9 17:14:14 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 12/9 17:14:14 ** $CondorVersion: 6.5.3 Jun 16 2003 PRE-RELEASE $
> 12/9 17:14:14 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
> 12/9 17:14:14 ** PID = 7302
> 12/9 17:14:14 ******************************************************
> 12/9 17:14:14 Using config file: /opt/condor/etc/condor.conf
> 12/9 17:14:15 DaemonCore: Command Socket at <130.246.183.184:32772>
> 12/9 17:14:15 "/opt/condor/sbin/condor_shadow -classad" did
> not produce any
> output, ignoring
> 12/9 17:14:15 "/opt/condor/sbin/condor_shadow.pvm -classad"
> did not produce
> any output, ignoring
> 12/9 17:14:15 "/opt/condor/sbin/condor_shadow.std -classad"
> did not produce
> any output, ignoring
> 12/9 17:16:32 Sent ad to central manager for
> [log in to unmask]
> 12/9 17:16:32 Removed old scratch dir
> /tmp/condor_g_scratch.0xb2c9ea0.14366
> 12/9 17:16:33 Removed old scratch dir
> /tmp/condor_g_scratch.0xb2ca040.14366
> 12/9 17:16:33 Started condor_gmanager for owner edguser pid=7615
> 12/9 17:16:33 Started condor_gmanager for owner edguser pid=7616
> 12/9 17:16:33 condor_write(): Socket closed when trying to
> write buffer
> 12/9 17:16:33 Buf::write(): condor_write() failed
> 12/9 17:16:33 SECMAN: Error sending response classad!
> 12/9 17:16:33 DaemonCore: Command received via TCP from host
> <130.246.183.184:50031>
> 12/9 17:16:33 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:16:34 DaemonCore: Command received via TCP from host
> <130.246.183.184:50086>
> 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:16:34 DaemonCore: Command received via TCP from host
> <130.246.183.184:50087>
> 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:16:34 DaemonCore: Command received via TCP from host
> <130.246.183.184:50090>
> 12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:16:35 DaemonCore: Command received via TCP from host
> <130.246.183.184:50092>
> 12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:16:35 DaemonCore: Command received via TCP from host
> <130.246.183.184:50093>
> 12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:22:10 Sent ad to central manager for
> [log in to unmask]
> 12/9 17:22:10 condor_write(): Socket closed when trying to
> write buffer
> 12/9 17:22:10 Buf::write(): condor_write() failed
> 12/9 17:22:10 SECMAN: Error sending response classad!
> 12/9 17:22:10 condor_write(): Socket closed when trying to
> write buffer
> 12/9 17:22:10 Buf::write(): condor_write() failed
> 12/9 17:22:10 AUTHENTICATE: handshake failed!
> 12/9 17:22:10 DaemonCore: Command received via TCP from host
> <130.246.183.184:50130>
> 12/9 17:22:10 DaemonCore: received command 478 (ACT_ON_JOBS), calling
> handler (actOnJobs)
> 12/9 17:22:10 condor_write(): Socket closed when trying to
> write buffer
> 12/9 17:22:10 Buf::write(): condor_write() failed
> 12/9 17:22:10 AUTHENTICATE: handshake failed!
> 12/9 17:22:10 condor_write(): Socket closed when trying to
> write buffer
> 12/9 17:22:10 Buf::write(): condor_write() failed
> 12/9 17:22:10 AUTHENTICATE: handshake failed!
> [root@lcgrb01 log]#
>
> The edg-wl-bkserverd processes are filling /var/log/messages with:
>
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7316]: File exists
> (duplicate event)
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file
> or directory
> (job not registered)
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file
> or directory
> (job not registered)
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7307]: No such file
> or directory
> (job not registered)
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7309]: No such file
> or directory
> (job not registered)
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file
> or directory
> (job not registered)
> Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7321]: No such file
> or directory
> (job not registered)
>
> several times a second.
>
> I predict that the system will be full by the morning and therefore
> unuseable if it isn't already. Clearly I'd like to turn it
> off and redeploy
> the box into our main batch system but that would screw up
> the LCG so has
> anyone got any ideas about this?
>
> Martin.
>
|