Print

Print


Hi Folks,

A tale of woe for you...

The pbs_sched scheduler crashed last week without us noticing.  When we came
to investigate on Monday, it restarted OK, but the queue of jobs had reached
80+.  None of which would work - they appeared to run for a bit then entered
the W state.

Looking at the WNs and the CE, it appeared superficially that they couldn't
talk to each other using the known hosts mechanism.  However, simple copies
 using scp worked both ways from either end.  The log messages from the
failed communications are these:

10/13/2003 17:19:54;0004;
pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file
globus-cache-export.u6v4sh.gpg from
lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v4sh/globus-cache-export.u6v4sh.gpg
10/13/2003 17:19:54;0004;
pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk:
Connection refused
10/13/2003 17:19:54;0004;
pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-export.u6v4sh.gpg:
No such file or directory
10/13/2003 17:19:54;0008;   pbs_mom;Req;del_files;cannot stat
globus-cache-export.u6v4sh.gpg
10/13/2003 17:20:51;0080;   pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg

globus-cache-export.Ruvcei.gpg status=1, try=1
10/13/2003 17:21:23;0080;   pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp
-r
[log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg

globus-cache-export.Ruvcei.gpg status=1, try=2
10/13/2003 17:21:34;0080;   pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg

globus-cache-export.Ruvcei.gpg status=1, try=3
10/13/2003 17:22:05;0080;   pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp
-r
[log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg

globus-cache-export.Ruvcei.gpg status=1, try=4
10/13/2003 17:22:26;0004;
pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file
globus-cache-export.Ruvcei.gpg from
lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg
10/13/2003 17:22:26;0004;
pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk:
Connection refused
10/13/2003 17:22:26;0004;
pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-export.Ruvcei.gpg:
No such file or directory
10/13/2003 17:22:26;0008;   pbs_mom;Req;del_files;cannot stat
globus-cache-export.Ruvcei.gpg
10/13/2003 17:47:04;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from
addr 130.246.183.182:1023

The problem appears to be that the job manager system is not creating the
appropriate bundle for the client pbs_mom to copy - the job start fails, and
the copy-back also fails.  How many retries pbs_server/pbs_sched will make
is not clear.

A CE reboot does not clear this problem.

In the absence of any logs which might tell me which CE service config is
b******d, Steve Traylen and I decided to reinstall and reconfigure openpbs
from scratch (terminally removing all queued jobs).  This appears to have
the desire effect - the CE is now running.

I believe this is the second time I've been forced to resort to this tactic
for this problem - so does anyone know:

a) where to look to find out why openpbs gets so confused, or where the job
manager bit that interfaces to PBS kepps its logs/config data?

b) what causes the problem in the first place?

and

c) why pbs_sched crashes?

As of 17:08 today, we seem to be running OK.

Martin.

P.S. I'll be at HEPiX next week and on holiday the week after - Steve
Traylen will be looking after our lcg1 nodes.