Print

Print


Hi Martin.

I'll tell you what I understood of this syndrome. Maarten and David may
add more details as they have investigated the issue in depth.

The pbs scheduler problem is apparently a known problem: in some
occasions the scheduler either crashes or, worse, keeps running but in a
"confused" mode. In both cases restarting it is supposed to put things
back on track. I understand that this specific syndrome might be related
to the way we are using it: this will need more investigation if we
decide to keep using PBS as the default batch system for small sites.

The fact that restarting it was not sufficient to fix your system is due
to a second problem: once in a while the qstat command issued by the
jobmanager to check if the job is still queued/running fails. This
condition was not correctly handled so that when this happened, the
gatekeeper thought the job was finished and cleared the corresponding
gass_cache area.

As the job was in fact still queued, when eventually the pbs server sent
it to the WN, pbs_mom would look for the corresponding gass_cache files
and, as they were not there anymore, it would fail starting the job. Now
comes the nasty bit: pbs has an internal mechanism, probably put in
place to handle slow shared filesystems, which assumes that the missing
files are due to some slowness in passing the files to the WN, so it
queues the job again, thus creating the infinite loop you observed.

As you noticed, I was using the past tense in the description: the new
release (that I am testing right now at CERN) is supposed to fix at
least the qstat issue, so that in principle, even if the scheduler can
still have problems, restarting it should really solve the problem.

BTW, when this happened at CERN last week, I just qdel'ed the looping
jobs and restarted the services: apparently this was enough to clean the
system.

Cheers

                 Emanuele


Martin Bly wrote:
> Hi Folks,
>
> A tale of woe for you...
>
> The pbs_sched scheduler crashed last week without us noticing.  When we came
> to investigate on Monday, it restarted OK, but the queue of jobs had reached
> 80+.  None of which would work - they appeared to run for a bit then entered
> the W state.
>
> Looking at the WNs and the CE, it appeared superficially that they couldn't
> talk to each other using the known hosts mechanism.  However, simple copies
>  using scp worked both ways from either end.  The log messages from the
> failed communications are these:
>
> 10/13/2003 17:19:54;0004;
> pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file
> globus-cache-export.u6v4sh.gpg from
> lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v4sh/globus-cache-export.u6v4sh.gpg
> 10/13/2003 17:19:54;0004;
> pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk:
> Connection refused
> 10/13/2003 17:19:54;0004;
> pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-export.u6v4sh.gpg:
> No such file or directory
> 10/13/2003 17:19:54;0008;   pbs_mom;Req;del_files;cannot stat
> globus-cache-export.u6v4sh.gpg
> 10/13/2003 17:20:51;0080;   pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
> [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg
>
> globus-cache-export.Ruvcei.gpg status=1, try=1
> 10/13/2003 17:21:23;0080;   pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp
> -r
> [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg
>
> globus-cache-export.Ruvcei.gpg status=1, try=2
> 10/13/2003 17:21:34;0080;   pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
> [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg
>
> globus-cache-export.Ruvcei.gpg status=1, try=3
> 10/13/2003 17:22:05;0080;   pbs_mom;Fil;sys_copy;command: /usr/sbin/pbs_rcp
> -r
> [log in to unmask]:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg
>
> globus-cache-export.Ruvcei.gpg status=1, try=4
> 10/13/2003 17:22:26;0004;
> pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file
> globus-cache-export.Ruvcei.gpg from
> lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruvcei/globus-cache-export.Ruvcei.gpg
> 10/13/2003 17:22:26;0004;
> pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk:
> Connection refused
> 10/13/2003 17:22:26;0004;
> pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-export.Ruvcei.gpg:
> No such file or directory
> 10/13/2003 17:22:26;0008;   pbs_mom;Req;del_files;cannot stat
> globus-cache-export.Ruvcei.gpg
> 10/13/2003 17:47:04;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from
> addr 130.246.183.182:1023
>
> The problem appears to be that the job manager system is not creating the
> appropriate bundle for the client pbs_mom to copy - the job start fails, and
> the copy-back also fails.  How many retries pbs_server/pbs_sched will make
> is not clear.
>
> A CE reboot does not clear this problem.
>
> In the absence of any logs which might tell me which CE service config is
> b******d, Steve Traylen and I decided to reinstall and reconfigure openpbs
> from scratch (terminally removing all queued jobs).  This appears to have
> the desire effect - the CE is now running.
>
> I believe this is the second time I've been forced to resort to this tactic
> for this problem - so does anyone know:
>
> a) where to look to find out why openpbs gets so confused, or where the job
> manager bit that interfaces to PBS kepps its logs/config data?
>
> b) what causes the problem in the first place?
>
> and
>
> c) why pbs_sched crashes?
>
> As of 17:08 today, we seem to be running OK.
>
> Martin.
>
> P.S. I'll be at HEPiX next week and on holiday the week after - Steve
> Traylen will be looking after our lcg1 nodes.


--
/------------------- Emanuele Leonardi -------------------\
| eMail: [log in to unmask] - Tel.: +41-22-7674066 |
|  IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23  |
\---------------------------------------------------------/