Hi Folks, A tale of woe for you... The pbs_sched scheduler crashed last week without us noticing. When we came to investigate on Monday, it restarted OK, but the queue of jobs had reached 80+. None of which would work - they appeared to run for a bit then entered the W state. Looking at the WNs and the CE, it appeared superficially that they couldn't talk to each other using the known hosts mechanism. However, simple copies using scp worked both ways from either end. The log messages from the failed communications are these: 10/13/2003 17:19:54;0004; pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file globus-cache-export.u6v4sh.gpg from lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v4sh/globus-cache-export.u6v4sh.gpg 10/13/2003 17:19:54;0004; pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk: Connection refused 10/13/2003 17:19:54;0004; pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-export.u6v4sh.gpg: No such file or directory 10/13/2003 17:19:54;0008; pbs_mom;Req;del_files;cannot stat globus-cache-export.u6v4sh.gpg 10/13/2003 17:20:51;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg globus-cache-export.Ruvcei.gpg status=1, try=1 10/13/2003 17:21:23;0080; pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp -r [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg globus-cache-export.Ruvcei.gpg status=1, try=2 10/13/2003 17:21:34;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg globus-cache-export.Ruvcei.gpg status=1, try=3 10/13/2003 17:22:05;0080; pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp -r [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg globus-cache-export.Ruvcei.gpg status=1, try=4 10/13/2003 17:22:26;0004; pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file globus-cache-export.Ruvcei.gpg from lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg 10/13/2003 17:22:26;0004; pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk: Connection refused 10/13/2003 17:22:26;0004; pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-export.Ruvcei.gpg: No such file or directory 10/13/2003 17:22:26;0008; pbs_mom;Req;del_files;cannot stat globus-cache-export.Ruvcei.gpg 10/13/2003 17:47:04;0001; pbs_mom;Svr;pbs_mom;im_eof, End of File from addr 130.246.183.182:1023 The problem appears to be that the job manager system is not creating the appropriate bundle for the client pbs_mom to copy - the job start fails, and the copy-back also fails. How many retries pbs_server/pbs_sched will make is not clear. A CE reboot does not clear this problem. In the absence of any logs which might tell me which CE service config is b******d, Steve Traylen and I decided to reinstall and reconfigure openpbs from scratch (terminally removing all queued jobs). This appears to have the desire effect - the CE is now running. I believe this is the second time I've been forced to resort to this tactic for this problem - so does anyone know: a) where to look to find out why openpbs gets so confused, or where the job manager bit that interfaces to PBS kepps its logs/config data? b) what causes the problem in the first place? and c) why pbs_sched crashes? As of 17:08 today, we seem to be running OK. Martin. P.S. I'll be at HEPiX next week and on holiday the week after - Steve Traylen will be looking after our lcg1 nodes.