JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT 2003
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: RAL status
From:
Jorge Gomes <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Tue, 28 Oct 2003 15:33:21 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (358 lines)
Dear Ian,
I would like to try the JLAB scheduler for PBS. Do you know where I
can get it ?

Regards,
Jorge Gomes
--
LIP - Laboratório de Instrumentação e Física Experimental de Partículas
Av. Elias Garcia 14, 1º  ;  1000-149 Lisboa  ;  Portugal
Tel: (+351) 217973880    Fax: (+351) 217934631    Gsm: (+351) 939580212





Quoting Ian Bird <[log in to unmask]>:

> I'm not sure if people are aware of this, but at JLAB we developed a
> scheduler for PBS that works like the LSF scheduler with hierarchical
> fairshares.  This is much better than Maui in typical HEP environments with
> heterogeneous clusters.  It should be available to any site that wants it.
>
> Ian
>
>       -----Original Message-----
>       From: Bly, MJ (Martin) [mailto:[log in to unmask]]
>       Sent: Thu 16-Oct-03 9:42
>       To: [log in to unmask]
>       Cc:
>       Subject: Re: [LCG-ROLLOUT] RAL status
>
>
>
>       Emanuele,
>
>       It is the openpbs system with the Maui scheduler - with some patches to
> make
>       it schedule on CPU time rather that wall time, since we have a
> heterogeneous
>       set of boxes running as batch the service.  Maui is a more sophisticated
>       scheduling engine giving us finer control but it has its problems.
>
>       Martin.
>       --
>         -------------------------------------------------------
>           Martin Bly  |  +44 1235 446981  |  [log in to unmask]
>              Systems Admin, Tier 1/A Service,  RAL PPD CSG
>         -------------------------------------------------------
>
>       > -----Original Message-----
>       > From: Emanuele LEONARDI [mailto:[log in to unmask]]
>       > Sent: Wednesday, October 15, 2003 6:34 PM
>       > To: [log in to unmask]
>       > Subject: Re: [LCG-ROLLOUT] RAL status
>       >
>       >
>       > Hi John.
>       >
>       > I was talking about the default batch system which is used in the
>       > example installation we provide. As we distribute it to everybody, we
>       > have to use a free one and OpenPBS was the simplest (but not the only)
>       > choice.
>       >
>       > Any site is of course free to choose whichever batch system
>       > they want as
>       > long as they know how to interface it to globus. Some examples for LSF
>       > and Condor are indeed already available and at CERN the batch
>       > system of
>       > choice will be LSF, already in use on lxbatch.
>       >
>       > In any case, if you are using OpenPBS on large scale at RAL, then it
>       > would be very interesting to see your configuration for the scheduler.
>       > But I guess you got the Professional version, there.
>       >
>       > Cheers
>       >
>       >                 Emanuele
>       >
>       > Gordon, JC (John) wrote:
>       > > Emanuele, Martin, the position Ian laid out at GDB last
>       > week was that it was
>       > > up to sites which batch system they use but LCG would be
>       > adding support for
>       > > some popular ones (like LSF, Condor). At RAL we have been
>       > using PBS as our
>       > > main batch system for some time and have no short-term
>       > intention to change.
>       > > I was not aware that we experienced such problems (Andrew
>       > may correct me) so
>       > > I am curious why LCG1 sees them. Are we running a different
>       > version or
>       > > configuring/using it in a different way? Perhaps we should
>       > be changing to
>       > > the professional version.
>       > >
>       > > John
>       > >
>       > > -----Original Message-----
>       > > From: Emanuele LEONARDI [mailto:[log in to unmask]]
>       > > Sent: 15 October 2003 17:53
>       > > To: [log in to unmask]
>       > > Subject: Re: [LCG-ROLLOUT] RAL status
>       > >
>       > >
>       > > Hi Martin.
>       > >
>       > > I'll tell you what I understood of this syndrome. Maarten
>       > and David may
>       > > add more details as they have investigated the issue in depth.
>       > >
>       > > The pbs scheduler problem is apparently a known problem: in some
>       > > occasions the scheduler either crashes or, worse, keeps
>       > running but in a
>       > > "confused" mode. In both cases restarting it is supposed to
>       > put things
>       > > back on track. I understand that this specific syndrome
>       > might be related
>       > > to the way we are using it: this will need more investigation if we
>       > > decide to keep using PBS as the default batch system for
>       > small sites.
>       > >
>       > > The fact that restarting it was not sufficient to fix your
>       > system is due
>       > > to a second problem: once in a while the qstat command issued by the
>       > > jobmanager to check if the job is still queued/running fails. This
>       > > condition was not correctly handled so that when this happened, the
>       > > gatekeeper thought the job was finished and cleared the
>       > corresponding
>       > > gass_cache area.
>       > >
>       > > As the job was in fact still queued, when eventually the
>       > pbs server sent
>       > > it to the WN, pbs_mom would look for the corresponding
>       > gass_cache files
>       > > and, as they were not there anymore, it would fail starting
>       > the job. Now
>       > > comes the nasty bit: pbs has an internal mechanism, probably put in
>       > > place to handle slow shared filesystems, which assumes that
>       > the missing
>       > > files are due to some slowness in passing the files to the WN, so it
>       > > queues the job again, thus creating the infinite loop you observed.
>       > >
>       > > As you noticed, I was using the past tense in the
>       > description: the new
>       > > release (that I am testing right now at CERN) is supposed to fix at
>       > > least the qstat issue, so that in principle, even if the
>       > scheduler can
>       > > still have problems, restarting it should really solve the problem.
>       > >
>       > > BTW, when this happened at CERN last week, I just qdel'ed
>       > the looping
>       > > jobs and restarted the services: apparently this was enough
>       > to clean the
>       > > system.
>       > >
>       > > Cheers
>       > >
>       > >                  Emanuele
>       > >
>       > >
>       > > Martin Bly wrote:
>       > >
>       > >>Hi Folks,
>       > >>
>       > >>A tale of woe for you...
>       > >>
>       > >>The pbs_sched scheduler crashed last week without us
>       > noticing.  When we
>       > >
>       > > came
>       > >
>       > >>to investigate on Monday, it restarted OK, but the queue of jobs had
>       > >
>       > > reached
>       > >
>       > >>80+.  None of which would work - they appeared to run for a bit then
>       > >
>       > > entered
>       > >
>       > >>the W state.
>       > >>
>       > >>Looking at the WNs and the CE, it appeared superficially that they
>       > >
>       > > couldn't
>       > >
>       > >>talk to each other using the known hosts mechanism.  However, simple
>       > >
>       > > copies
>       > >
>       > >> using scp worked both ways from either end.  The log
>       > messages from the
>       > >>failed communications are these:
>       > >>
>       > >>10/13/2003 17:19:54;0004;
>       > >>pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file
>       > >>globus-cache-export.u6v4sh.gpg from
>       > >>
>       > >
>       > >
>       > lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v
>       > 4sh/globus-cac
>       > > he-export.u6v4sh.gpg
>       > >
>       > >>10/13/2003 17:19:54;0004;
>       > >>pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk:
>       > >>Connection refused
>       > >>10/13/2003 17:19:54;0004;
>       > >>
>       > >
>       > >
>       > pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-exp
>       > ort.u6v4sh.gpg
>       > > :
>       > >
>       > >>No such file or directory
>       > >>10/13/2003 17:19:54;0008;   pbs_mom;Req;del_files;cannot stat
>       > >>globus-cache-export.u6v4sh.gpg
>       > >>10/13/2003 17:20:51;0080;   pbs_mom;Fil;sys_copy;command:
>       > /usr/bin/scp -Br
>       > >>
>       > >
>       > >
>       > [log in to unmask]:/home/dteam004/globus-cache-e
>       > xport.Ruvcei/g
>       > > lobus-cache-export.Ruvcei.gpg
>       > >
>       > >>globus-cache-export.Ruvcei.gpg status=1, try=1
>       > >>10/13/2003 17:21:23;0080;   pbs_mom;Fil;sys_copy;command:
>       > >
>       > > /usr/sbin/pbs_rcp
>       > >
>       > >>-r
>       > >>
>       > >
>       > >
>       > [log in to unmask]:/home/dteam004/globus-cache-e
>       > xport.Ruvcei/g
>       > > lobus-cache-export.Ruvcei.gpg
>       > >
>       > >>globus-cache-export.Ruvcei.gpg status=1, try=2
>       > >>10/13/2003 17:21:34;0080;   pbs_mom;Fil;sys_copy;command:
>       > /usr/bin/scp -Br
>       > >>
>       > >
>       > >
>       > [log in to unmask]:/home/dteam004/globus-cache-e
>       > xport.Ruvcei/g
>       > > lobus-cache-export.Ruvcei.gpg
>       > >
>       > >>globus-cache-export.Ruvcei.gpg status=1, try=3
>       > >>10/13/2003 17:22:05;0080;   pbs_mom;Fil;sys_copy;command:
>       > >
>       > > /usr/sbin/pbs_rcp
>       > >
>       > >>-r
>       > >>
>       > >
>       > >
>       > [log in to unmask]:/home/dteam004/globus-cache-e
>       > xport.Ruvcei/g
>       > > lobus-cache-export.Ruvcei.gpg
>       > >
>       > >>globus-cache-export.Ruvcei.gpg status=1, try=4
>       > >>10/13/2003 17:22:26;0004;
>       > >>pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file
>       > >>globus-cache-export.Ruvcei.gpg from
>       > >>
>       > >
>       > >
>       > lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruv
>       > cei/globus-cac
>       > > he-export.Ruvcei.gpg
>       > >
>       > >>10/13/2003 17:22:26;0004;
>       > >>pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk:
>       > >>Connection refused
>       > >>10/13/2003 17:22:26;0004;
>       > >>
>       > >
>       > >
>       > pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-exp
>       > ort.Ruvcei.gpg
>       > > :
>       > >
>       > >>No such file or directory
>       > >>10/13/2003 17:22:26;0008;   pbs_mom;Req;del_files;cannot stat
>       > >>globus-cache-export.Ruvcei.gpg
>       > >>10/13/2003 17:47:04;0001;   pbs_mom;Svr;pbs_mom;im_eof, End
>       > of File from
>       > >>addr 130.246.183.182:1023
>       > >>
>       > >>The problem appears to be that the job manager system is
>       > not creating the
>       > >>appropriate bundle for the client pbs_mom to copy - the job
>       > start fails,
>       > >
>       > > and
>       > >
>       > >>the copy-back also fails.  How many retries
>       > pbs_server/pbs_sched will make
>       > >>is not clear.
>       > >>
>       > >>A CE reboot does not clear this problem.
>       > >>
>       > >>In the absence of any logs which might tell me which CE
>       > service config is
>       > >>b******d, Steve Traylen and I decided to reinstall and
>       > reconfigure openpbs
>       > >>from scratch (terminally removing all queued jobs).  This
>       > appears to have
>       > >>the desire effect - the CE is now running.
>       > >>
>       > >>I believe this is the second time I've been forced to resort to this
>       > >
>       > > tactic
>       > >
>       > >>for this problem - so does anyone know:
>       > >>
>       > >>a) where to look to find out why openpbs gets so confused,
>       > or where the
>       > >
>       > > job
>       > >
>       > >>manager bit that interfaces to PBS kepps its logs/config data?
>       > >>
>       > >>b) what causes the problem in the first place?
>       > >>
>       > >>and
>       > >>
>       > >>c) why pbs_sched crashes?
>       > >>
>       > >>As of 17:08 today, we seem to be running OK.
>       > >>
>       > >>Martin.
>       > >>
>       > >>P.S. I'll be at HEPiX next week and on holiday the week
>       > after - Steve
>       > >>Traylen will be looking after our lcg1 nodes.
>       > >
>       > >
>       > >
>       > > --
>       > > /------------------- Emanuele Leonardi -------------------\
>       > > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
>       > > |  IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23  |
>       > > \---------------------------------------------------------/
>       >
>       >
>       > --
>       > /------------------- Emanuele Leonardi -------------------\
>       > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
>       > |  IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23  |
>       > \---------------------------------------------------------/
>       >
>
>
>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options