To answer John's point about the openpbs running on the main farm - we use
the Maui scheduler which replaces pbs_sched - it has its own problems but
this doesn't seem to be one of them.
Martin.
--
-------------------------------------------------------
Martin Bly | +44 1235 446981 | [log in to unmask]
Systems Admin, Tier 1/A Service, RAL PPD CSG
-------------------------------------------------------
> -----Original Message-----
> From: Gordon, JC (John) [mailto:[log in to unmask]]
> Sent: Wednesday, October 15, 2003 6:07 PM
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] RAL status
>
>
> Emanuele, Martin, the position Ian laid out at GDB last week
> was that it was
> up to sites which batch system they use but LCG would be
> adding support for
> some popular ones (like LSF, Condor). At RAL we have been
> using PBS as our
> main batch system for some time and have no short-term
> intention to change.
> I was not aware that we experienced such problems (Andrew may
> correct me) so
> I am curious why LCG1 sees them. Are we running a different version or
> configuring/using it in a different way? Perhaps we should be
> changing to
> the professional version.
>
> John
>
> -----Original Message-----
> From: Emanuele LEONARDI [mailto:[log in to unmask]]
> Sent: 15 October 2003 17:53
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] RAL status
>
>
> Hi Martin.
>
> I'll tell you what I understood of this syndrome. Maarten and
> David may
> add more details as they have investigated the issue in depth.
>
> The pbs scheduler problem is apparently a known problem: in some
> occasions the scheduler either crashes or, worse, keeps
> running but in a
> "confused" mode. In both cases restarting it is supposed to put things
> back on track. I understand that this specific syndrome might
> be related
> to the way we are using it: this will need more investigation if we
> decide to keep using PBS as the default batch system for small sites.
>
> The fact that restarting it was not sufficient to fix your
> system is due
> to a second problem: once in a while the qstat command issued by the
> jobmanager to check if the job is still queued/running fails. This
> condition was not correctly handled so that when this happened, the
> gatekeeper thought the job was finished and cleared the corresponding
> gass_cache area.
>
> As the job was in fact still queued, when eventually the pbs
> server sent
> it to the WN, pbs_mom would look for the corresponding
> gass_cache files
> and, as they were not there anymore, it would fail starting
> the job. Now
> comes the nasty bit: pbs has an internal mechanism, probably put in
> place to handle slow shared filesystems, which assumes that
> the missing
> files are due to some slowness in passing the files to the WN, so it
> queues the job again, thus creating the infinite loop you observed.
>
> As you noticed, I was using the past tense in the description: the new
> release (that I am testing right now at CERN) is supposed to fix at
> least the qstat issue, so that in principle, even if the scheduler can
> still have problems, restarting it should really solve the problem.
>
> BTW, when this happened at CERN last week, I just qdel'ed the looping
> jobs and restarted the services: apparently this was enough
> to clean the
> system.
>
> Cheers
>
> Emanuele
>
>
> Martin Bly wrote:
> > Hi Folks,
> >
> > A tale of woe for you...
> >
> > The pbs_sched scheduler crashed last week without us
> noticing. When we
> came
> > to investigate on Monday, it restarted OK, but the queue of jobs had
> reached
> > 80+. None of which would work - they appeared to run for a bit then
> entered
> > the W state.
> >
> > Looking at the WNs and the CE, it appeared superficially that they
> couldn't
> > talk to each other using the known hosts mechanism. However, simple
> copies
> > using scp worked both ways from either end. The log
> messages from the
> > failed communications are these:
> >
> > 10/13/2003 17:19:54;0004;
> > pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file
> > globus-cache-export.u6v4sh.gpg from
> >
> lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v
> 4sh/globus-cac
> he-export.u6v4sh.gpg
> > 10/13/2003 17:19:54;0004;
> > pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk:
> > Connection refused
> > 10/13/2003 17:19:54;0004;
> >
> pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-exp
> ort.u6v4sh.gpg
> :
> > No such file or directory
> > 10/13/2003 17:19:54;0008; pbs_mom;Req;del_files;cannot stat
> > globus-cache-export.u6v4sh.gpg
> > 10/13/2003 17:20:51;0080; pbs_mom;Fil;sys_copy;command:
> /usr/bin/scp -Br
> >
> [log in to unmask]:/home/dteam004/globus-cache-e
> xport.Ruvcei/g
> lobus-cache-export.Ruvcei.gpg
> >
> > globus-cache-export.Ruvcei.gpg status=1, try=1
> > 10/13/2003 17:21:23;0080; pbs_mom;Fil;sys_copy;command:
> /usr/sbin/pbs_rcp
> > -r
> >
> [log in to unmask]:/home/dteam004/globus-cache-e
> xport.Ruvcei/g
> lobus-cache-export.Ruvcei.gpg
> >
> > globus-cache-export.Ruvcei.gpg status=1, try=2
> > 10/13/2003 17:21:34;0080; pbs_mom;Fil;sys_copy;command:
> /usr/bin/scp -Br
> >
> [log in to unmask]:/home/dteam004/globus-cache-e
> xport.Ruvcei/g
> lobus-cache-export.Ruvcei.gpg
> >
> > globus-cache-export.Ruvcei.gpg status=1, try=3
> > 10/13/2003 17:22:05;0080; pbs_mom;Fil;sys_copy;command:
> /usr/sbin/pbs_rcp
> > -r
> >
> [log in to unmask]:/home/dteam004/globus-cache-e
> xport.Ruvcei/g
> lobus-cache-export.Ruvcei.gpg
> >
> > globus-cache-export.Ruvcei.gpg status=1, try=4
> > 10/13/2003 17:22:26;0004;
> > pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file
> > globus-cache-export.Ruvcei.gpg from
> >
> lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruv
> cei/globus-cac
> he-export.Ruvcei.gpg
> > 10/13/2003 17:22:26;0004;
> > pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk:
> > Connection refused
> > 10/13/2003 17:22:26;0004;
> >
> pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-exp
> ort.Ruvcei.gpg
> :
> > No such file or directory
> > 10/13/2003 17:22:26;0008; pbs_mom;Req;del_files;cannot stat
> > globus-cache-export.Ruvcei.gpg
> > 10/13/2003 17:47:04;0001; pbs_mom;Svr;pbs_mom;im_eof, End
> of File from
> > addr 130.246.183.182:1023
> >
> > The problem appears to be that the job manager system is
> not creating the
> > appropriate bundle for the client pbs_mom to copy - the job
> start fails,
> and
> > the copy-back also fails. How many retries
> pbs_server/pbs_sched will make
> > is not clear.
> >
> > A CE reboot does not clear this problem.
> >
> > In the absence of any logs which might tell me which CE
> service config is
> > b******d, Steve Traylen and I decided to reinstall and
> reconfigure openpbs
> > from scratch (terminally removing all queued jobs). This
> appears to have
> > the desire effect - the CE is now running.
> >
> > I believe this is the second time I've been forced to resort to this
> tactic
> > for this problem - so does anyone know:
> >
> > a) where to look to find out why openpbs gets so confused,
> or where the
> job
> > manager bit that interfaces to PBS kepps its logs/config data?
> >
> > b) what causes the problem in the first place?
> >
> > and
> >
> > c) why pbs_sched crashes?
> >
> > As of 17:08 today, we seem to be running OK.
> >
> > Martin.
> >
> > P.S. I'll be at HEPiX next week and on holiday the week
> after - Steve
> > Traylen will be looking after our lcg1 nodes.
>
>
> --
> /------------------- Emanuele Leonardi -------------------\
> | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> \---------------------------------------------------------/
>
|