Dear Ian,
I would like to try the JLAB scheduler for PBS. Do you know where I
can get it ?
Regards,
Jorge Gomes
--
LIP - Laboratório de Instrumentação e Física Experimental de Partículas
Av. Elias Garcia 14, 1º ; 1000-149 Lisboa ; Portugal
Tel: (+351) 217973880 Fax: (+351) 217934631 Gsm: (+351) 939580212
Quoting Ian Bird <[log in to unmask]>:
> I'm not sure if people are aware of this, but at JLAB we developed a
> scheduler for PBS that works like the LSF scheduler with hierarchical
> fairshares. This is much better than Maui in typical HEP environments with
> heterogeneous clusters. It should be available to any site that wants it.
>
> Ian
>
> -----Original Message-----
> From: Bly, MJ (Martin) [mailto:[log in to unmask]]
> Sent: Thu 16-Oct-03 9:42
> To: [log in to unmask]
> Cc:
> Subject: Re: [LCG-ROLLOUT] RAL status
>
>
>
> Emanuele,
>
> It is the openpbs system with the Maui scheduler - with some patches to
> make
> it schedule on CPU time rather that wall time, since we have a
> heterogeneous
> set of boxes running as batch the service. Maui is a more sophisticated
> scheduling engine giving us finer control but it has its problems.
>
> Martin.
> --
> -------------------------------------------------------
> Martin Bly | +44 1235 446981 | [log in to unmask]
> Systems Admin, Tier 1/A Service, RAL PPD CSG
> -------------------------------------------------------
>
> > -----Original Message-----
> > From: Emanuele LEONARDI [mailto:[log in to unmask]]
> > Sent: Wednesday, October 15, 2003 6:34 PM
> > To: [log in to unmask]
> > Subject: Re: [LCG-ROLLOUT] RAL status
> >
> >
> > Hi John.
> >
> > I was talking about the default batch system which is used in the
> > example installation we provide. As we distribute it to everybody, we
> > have to use a free one and OpenPBS was the simplest (but not the only)
> > choice.
> >
> > Any site is of course free to choose whichever batch system
> > they want as
> > long as they know how to interface it to globus. Some examples for LSF
> > and Condor are indeed already available and at CERN the batch
> > system of
> > choice will be LSF, already in use on lxbatch.
> >
> > In any case, if you are using OpenPBS on large scale at RAL, then it
> > would be very interesting to see your configuration for the scheduler.
> > But I guess you got the Professional version, there.
> >
> > Cheers
> >
> > Emanuele
> >
> > Gordon, JC (John) wrote:
> > > Emanuele, Martin, the position Ian laid out at GDB last
> > week was that it was
> > > up to sites which batch system they use but LCG would be
> > adding support for
> > > some popular ones (like LSF, Condor). At RAL we have been
> > using PBS as our
> > > main batch system for some time and have no short-term
> > intention to change.
> > > I was not aware that we experienced such problems (Andrew
> > may correct me) so
> > > I am curious why LCG1 sees them. Are we running a different
> > version or
> > > configuring/using it in a different way? Perhaps we should
> > be changing to
> > > the professional version.
> > >
> > > John
> > >
> > > -----Original Message-----
> > > From: Emanuele LEONARDI [mailto:[log in to unmask]]
> > > Sent: 15 October 2003 17:53
> > > To: [log in to unmask]
> > > Subject: Re: [LCG-ROLLOUT] RAL status
> > >
> > >
> > > Hi Martin.
> > >
> > > I'll tell you what I understood of this syndrome. Maarten
> > and David may
> > > add more details as they have investigated the issue in depth.
> > >
> > > The pbs scheduler problem is apparently a known problem: in some
> > > occasions the scheduler either crashes or, worse, keeps
> > running but in a
> > > "confused" mode. In both cases restarting it is supposed to
> > put things
> > > back on track. I understand that this specific syndrome
> > might be related
> > > to the way we are using it: this will need more investigation if we
> > > decide to keep using PBS as the default batch system for
> > small sites.
> > >
> > > The fact that restarting it was not sufficient to fix your
> > system is due
> > > to a second problem: once in a while the qstat command issued by the
> > > jobmanager to check if the job is still queued/running fails. This
> > > condition was not correctly handled so that when this happened, the
> > > gatekeeper thought the job was finished and cleared the
> > corresponding
> > > gass_cache area.
> > >
> > > As the job was in fact still queued, when eventually the
> > pbs server sent
> > > it to the WN, pbs_mom would look for the corresponding
> > gass_cache files
> > > and, as they were not there anymore, it would fail starting
> > the job. Now
> > > comes the nasty bit: pbs has an internal mechanism, probably put in
> > > place to handle slow shared filesystems, which assumes that
> > the missing
> > > files are due to some slowness in passing the files to the WN, so it
> > > queues the job again, thus creating the infinite loop you observed.
> > >
> > > As you noticed, I was using the past tense in the
> > description: the new
> > > release (that I am testing right now at CERN) is supposed to fix at
> > > least the qstat issue, so that in principle, even if the
> > scheduler can
> > > still have problems, restarting it should really solve the problem.
> > >
> > > BTW, when this happened at CERN last week, I just qdel'ed
> > the looping
> > > jobs and restarted the services: apparently this was enough
> > to clean the
> > > system.
> > >
> > > Cheers
> > >
> > > Emanuele
> > >
> > >
> > > Martin Bly wrote:
> > >
> > >>Hi Folks,
> > >>
> > >>A tale of woe for you...
> > >>
> > >>The pbs_sched scheduler crashed last week without us
> > noticing. When we
> > >
> > > came
> > >
> > >>to investigate on Monday, it restarted OK, but the queue of jobs had
> > >
> > > reached
> > >
> > >>80+. None of which would work - they appeared to run for a bit then
> > >
> > > entered
> > >
> > >>the W state.
> > >>
> > >>Looking at the WNs and the CE, it appeared superficially that they
> > >
> > > couldn't
> > >
> > >>talk to each other using the known hosts mechanism. However, simple
> > >
> > > copies
> > >
> > >> using scp worked both ways from either end. The log
> > messages from the
> > >>failed communications are these:
> > >>
> > >>10/13/2003 17:19:54;0004;
> > >>pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file
> > >>globus-cache-export.u6v4sh.gpg from
> > >>
> > >
> > >
> > lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v
> > 4sh/globus-cac
> > > he-export.u6v4sh.gpg
> > >
> > >>10/13/2003 17:19:54;0004;
> > >>pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk:
> > >>Connection refused
> > >>10/13/2003 17:19:54;0004;
> > >>
> > >
> > >
> > pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-exp
> > ort.u6v4sh.gpg
> > > :
> > >
> > >>No such file or directory
> > >>10/13/2003 17:19:54;0008; pbs_mom;Req;del_files;cannot stat
> > >>globus-cache-export.u6v4sh.gpg
> > >>10/13/2003 17:20:51;0080; pbs_mom;Fil;sys_copy;command:
> > /usr/bin/scp -Br
> > >>
> > >
> > >
> > [log in to unmask]:/home/dteam004/globus-cache-e
> > xport.Ruvcei/g
> > > lobus-cache-export.Ruvcei.gpg
> > >
> > >>globus-cache-export.Ruvcei.gpg status=1, try=1
> > >>10/13/2003 17:21:23;0080; pbs_mom;Fil;sys_copy;command:
> > >
> > > /usr/sbin/pbs_rcp
> > >
> > >>-r
> > >>
> > >
> > >
> > [log in to unmask]:/home/dteam004/globus-cache-e
> > xport.Ruvcei/g
> > > lobus-cache-export.Ruvcei.gpg
> > >
> > >>globus-cache-export.Ruvcei.gpg status=1, try=2
> > >>10/13/2003 17:21:34;0080; pbs_mom;Fil;sys_copy;command:
> > /usr/bin/scp -Br
> > >>
> > >
> > >
> > [log in to unmask]:/home/dteam004/globus-cache-e
> > xport.Ruvcei/g
> > > lobus-cache-export.Ruvcei.gpg
> > >
> > >>globus-cache-export.Ruvcei.gpg status=1, try=3
> > >>10/13/2003 17:22:05;0080; pbs_mom;Fil;sys_copy;command:
> > >
> > > /usr/sbin/pbs_rcp
> > >
> > >>-r
> > >>
> > >
> > >
> > [log in to unmask]:/home/dteam004/globus-cache-e
> > xport.Ruvcei/g
> > > lobus-cache-export.Ruvcei.gpg
> > >
> > >>globus-cache-export.Ruvcei.gpg status=1, try=4
> > >>10/13/2003 17:22:26;0004;
> > >>pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file
> > >>globus-cache-export.Ruvcei.gpg from
> > >>
> > >
> > >
> > lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruv
> > cei/globus-cac
> > > he-export.Ruvcei.gpg
> > >
> > >>10/13/2003 17:22:26;0004;
> > >>pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk:
> > >>Connection refused
> > >>10/13/2003 17:22:26;0004;
> > >>
> > >
> > >
> > pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-exp
> > ort.Ruvcei.gpg
> > > :
> > >
> > >>No such file or directory
> > >>10/13/2003 17:22:26;0008; pbs_mom;Req;del_files;cannot stat
> > >>globus-cache-export.Ruvcei.gpg
> > >>10/13/2003 17:47:04;0001; pbs_mom;Svr;pbs_mom;im_eof, End
> > of File from
> > >>addr 130.246.183.182:1023
> > >>
> > >>The problem appears to be that the job manager system is
> > not creating the
> > >>appropriate bundle for the client pbs_mom to copy - the job
> > start fails,
> > >
> > > and
> > >
> > >>the copy-back also fails. How many retries
> > pbs_server/pbs_sched will make
> > >>is not clear.
> > >>
> > >>A CE reboot does not clear this problem.
> > >>
> > >>In the absence of any logs which might tell me which CE
> > service config is
> > >>b******d, Steve Traylen and I decided to reinstall and
> > reconfigure openpbs
> > >>from scratch (terminally removing all queued jobs). This
> > appears to have
> > >>the desire effect - the CE is now running.
> > >>
> > >>I believe this is the second time I've been forced to resort to this
> > >
> > > tactic
> > >
> > >>for this problem - so does anyone know:
> > >>
> > >>a) where to look to find out why openpbs gets so confused,
> > or where the
> > >
> > > job
> > >
> > >>manager bit that interfaces to PBS kepps its logs/config data?
> > >>
> > >>b) what causes the problem in the first place?
> > >>
> > >>and
> > >>
> > >>c) why pbs_sched crashes?
> > >>
> > >>As of 17:08 today, we seem to be running OK.
> > >>
> > >>Martin.
> > >>
> > >>P.S. I'll be at HEPiX next week and on holiday the week
> > after - Steve
> > >>Traylen will be looking after our lcg1 nodes.
> > >
> > >
> > >
> > > --
> > > /------------------- Emanuele Leonardi -------------------\
> > > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> > > | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> > > \---------------------------------------------------------/
> >
> >
> > --
> > /------------------- Emanuele Leonardi -------------------\
> > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> > | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> > \---------------------------------------------------------/
> >
>
>
>
|