The rAL RB is somewhat sick, despite having some ports opened in our
firewall. The contents of /opt/globus/var/condor/log/Schedlog for example:
12/9 17:14:14 ******************************************************
12/9 17:14:14 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
12/9 17:14:14 ** $CondorVersion: 6.5.3 Jun 16 2003 PRE-RELEASE $
12/9 17:14:14 ** $CondorPlatform: INTEL-LINUX-GLIBC22 $
12/9 17:14:14 ** PID = 7302
12/9 17:14:14 ******************************************************
12/9 17:14:14 Using config file: /opt/condor/etc/condor.conf
12/9 17:14:15 DaemonCore: Command Socket at <130.246.183.184:32772>
12/9 17:14:15 "/opt/condor/sbin/condor_shadow -classad" did not produce any
output, ignoring
12/9 17:14:15 "/opt/condor/sbin/condor_shadow.pvm -classad" did not produce
any output, ignoring
12/9 17:14:15 "/opt/condor/sbin/condor_shadow.std -classad" did not produce
any output, ignoring
12/9 17:16:32 Sent ad to central manager for [log in to unmask]
12/9 17:16:32 Removed old scratch dir /tmp/condor_g_scratch.0xb2c9ea0.14366
12/9 17:16:33 Removed old scratch dir /tmp/condor_g_scratch.0xb2ca040.14366
12/9 17:16:33 Started condor_gmanager for owner edguser pid=7615
12/9 17:16:33 Started condor_gmanager for owner edguser pid=7616
12/9 17:16:33 condor_write(): Socket closed when trying to write buffer
12/9 17:16:33 Buf::write(): condor_write() failed
12/9 17:16:33 SECMAN: Error sending response classad!
12/9 17:16:33 DaemonCore: Command received via TCP from host
<130.246.183.184:50031>
12/9 17:16:33 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:16:34 DaemonCore: Command received via TCP from host
<130.246.183.184:50086>
12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:16:34 DaemonCore: Command received via TCP from host
<130.246.183.184:50087>
12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:16:34 DaemonCore: Command received via TCP from host
<130.246.183.184:50090>
12/9 17:16:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:16:35 DaemonCore: Command received via TCP from host
<130.246.183.184:50092>
12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:16:35 DaemonCore: Command received via TCP from host
<130.246.183.184:50093>
12/9 17:16:35 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:22:10 Sent ad to central manager for [log in to unmask]
12/9 17:22:10 condor_write(): Socket closed when trying to write buffer
12/9 17:22:10 Buf::write(): condor_write() failed
12/9 17:22:10 SECMAN: Error sending response classad!
12/9 17:22:10 condor_write(): Socket closed when trying to write buffer
12/9 17:22:10 Buf::write(): condor_write() failed
12/9 17:22:10 AUTHENTICATE: handshake failed!
12/9 17:22:10 DaemonCore: Command received via TCP from host
<130.246.183.184:50130>
12/9 17:22:10 DaemonCore: received command 478 (ACT_ON_JOBS), calling
handler (actOnJobs)
12/9 17:22:10 condor_write(): Socket closed when trying to write buffer
12/9 17:22:10 Buf::write(): condor_write() failed
12/9 17:22:10 AUTHENTICATE: handshake failed!
12/9 17:22:10 condor_write(): Socket closed when trying to write buffer
12/9 17:22:10 Buf::write(): condor_write() failed
12/9 17:22:10 AUTHENTICATE: handshake failed!
[root@lcgrb01 log]#
The edg-wl-bkserverd processes are filling /var/log/messages with:
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7316]: File exists (duplicate event)
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file or directory
(job not registered)
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file or directory
(job not registered)
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7307]: No such file or directory
(job not registered)
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7309]: No such file or directory
(job not registered)
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[8212]: No such file or directory
(job not registered)
Dec 9 17:29:46 lcgrb01 edg-wl-bkserverd[7321]: No such file or directory
(job not registered)
several times a second.
I predict that the system will be full by the morning and therefore
unuseable if it isn't already. Clearly I'd like to turn it off and redeploy
the box into our main batch system but that would screw up the LCG so has
anyone got any ideas about this?
Martin.
|