JISCMail - LCG-ROLLOUT Archives

I'm not sure if people are aware of this, but at JLAB we developed a scheduler for PBS that works like the LSF scheduler with hierarchical fairshares.  This is much better than Maui in typical HEP environments with heterogeneous clusters.  It should be available to any site that wants it.
 
Ian

	-----Original Message----- 
	From: Bly, MJ (Martin) [mailto:[log in to unmask]] 
	Sent: Thu 16-Oct-03 9:42 
	To: [log in to unmask] 
	Cc: 
	Subject: Re: [LCG-ROLLOUT] RAL status
	
	

	Emanuele,
	
	It is the openpbs system with the Maui scheduler - with some patches to make
	it schedule on CPU time rather that wall time, since we have a heterogeneous
	set of boxes running as batch the service.  Maui is a more sophisticated
	scheduling engine giving us finer control but it has its problems.
	
	Martin.
	--
	  -------------------------------------------------------
	    Martin Bly  |  +44 1235 446981  |  [log in to unmask]
	       Systems Admin, Tier 1/A Service,  RAL PPD CSG
	  -------------------------------------------------------
	
	> -----Original Message-----
	> From: Emanuele LEONARDI [mailto:[log in to unmask]]
	> Sent: Wednesday, October 15, 2003 6:34 PM
	> To: [log in to unmask]
	> Subject: Re: [LCG-ROLLOUT] RAL status
	>
	>
	> Hi John.
	>
	> I was talking about the default batch system which is used in the
	> example installation we provide. As we distribute it to everybody, we
	> have to use a free one and OpenPBS was the simplest (but not the only)
	> choice.
	>
	> Any site is of course free to choose whichever batch system
	> they want as
	> long as they know how to interface it to globus. Some examples for LSF
	> and Condor are indeed already available and at CERN the batch
	> system of
	> choice will be LSF, already in use on lxbatch.
	>
	> In any case, if you are using OpenPBS on large scale at RAL, then it
	> would be very interesting to see your configuration for the scheduler.
	> But I guess you got the Professional version, there.
	>
	> Cheers
	>
	>                 Emanuele
	>
	> Gordon, JC (John) wrote:
	> > Emanuele, Martin, the position Ian laid out at GDB last
	> week was that it was
	> > up to sites which batch system they use but LCG would be
	> adding support for
	> > some popular ones (like LSF, Condor). At RAL we have been
	> using PBS as our
	> > main batch system for some time and have no short-term
	> intention to change.
	> > I was not aware that we experienced such problems (Andrew
	> may correct me) so
	> > I am curious why LCG1 sees them. Are we running a different
	> version or
	> > configuring/using it in a different way? Perhaps we should
	> be changing to
	> > the professional version.
	> >
	> > John
	> >
	> > -----Original Message-----
	> > From: Emanuele LEONARDI [mailto:[log in to unmask]]
	> > Sent: 15 October 2003 17:53
	> > To: [log in to unmask]
	> > Subject: Re: [LCG-ROLLOUT] RAL status
	> >
	> >
	> > Hi Martin.
	> >
	> > I'll tell you what I understood of this syndrome. Maarten
	> and David may
	> > add more details as they have investigated the issue in depth.
	> >
	> > The pbs scheduler problem is apparently a known problem: in some
	> > occasions the scheduler either crashes or, worse, keeps
	> running but in a
	> > "confused" mode. In both cases restarting it is supposed to
	> put things
	> > back on track. I understand that this specific syndrome
	> might be related
	> > to the way we are using it: this will need more investigation if we
	> > decide to keep using PBS as the default batch system for
	> small sites.
	> >
	> > The fact that restarting it was not sufficient to fix your
	> system is due
	> > to a second problem: once in a while the qstat command issued by the
	> > jobmanager to check if the job is still queued/running fails. This
	> > condition was not correctly handled so that when this happened, the
	> > gatekeeper thought the job was finished and cleared the
	> corresponding
	> > gass_cache area.
	> >
	> > As the job was in fact still queued, when eventually the
	> pbs server sent
	> > it to the WN, pbs_mom would look for the corresponding
	> gass_cache files
	> > and, as they were not there anymore, it would fail starting
	> the job. Now
	> > comes the nasty bit: pbs has an internal mechanism, probably put in
	> > place to handle slow shared filesystems, which assumes that
	> the missing
	> > files are due to some slowness in passing the files to the WN, so it
	> > queues the job again, thus creating the infinite loop you observed.
	> >
	> > As you noticed, I was using the past tense in the
	> description: the new
	> > release (that I am testing right now at CERN) is supposed to fix at
	> > least the qstat issue, so that in principle, even if the
	> scheduler can
	> > still have problems, restarting it should really solve the problem.
	> >
	> > BTW, when this happened at CERN last week, I just qdel'ed
	> the looping
	> > jobs and restarted the services: apparently this was enough
	> to clean the
	> > system.
	> >
	> > Cheers
	> >
	> >                  Emanuele
	> >
	> >
	> > Martin Bly wrote:
	> >
	> >>Hi Folks,
	> >>
	> >>A tale of woe for you...
	> >>
	> >>The pbs_sched scheduler crashed last week without us
	> noticing.  When we
	> >
	> > came
	> >
	> >>to investigate on Monday, it restarted OK, but the queue of jobs had
	> >
	> > reached
	> >
	> >>80+.  None of which would work - they appeared to run for a bit then
	> >
	> > entered
	> >
	> >>the W state.
	> >>
	> >>Looking at the WNs and the CE, it appeared superficially that they
	> >
	> > couldn't
	> >
	> >>talk to each other using the known hosts mechanism.  However, simple
	> >
	> > copies
	> >
	> >> using scp worked both ways from either end.  The log
	> messages from the
	> >>failed communications are these:
	> >>
	> >>10/13/2003 17:19:54;0004;
	> >>pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;Unable to copy file
	> >>globus-cache-export.u6v4sh.gpg from
	> >>
	> >
	> >
	> lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.u6v
	> 4sh/globus-cac
	> > he-export.u6v4sh.gpg
	> >
	> >>10/13/2003 17:19:54;0004;
	> >>pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;lcgce01.gridpp.rl.ac.uk:
	> >>Connection refused
	> >>10/13/2003 17:19:54;0004;
	> >>
	> >
	> >
	> pbs_mom;Fil;globus-cache-export.u6v4sh.gpg;sh/globus-cache-exp
	> ort.u6v4sh.gpg
	> > :
	> >
	> >>No such file or directory
	> >>10/13/2003 17:19:54;0008;   pbs_mom;Req;del_files;cannot stat
	> >>globus-cache-export.u6v4sh.gpg
	> >>10/13/2003 17:20:51;0080;   pbs_mom;Fil;sys_copy;command:
	> /usr/bin/scp -Br
	> >>
	> >
	> >
	> [log in to unmask]:/home/dteam004/globus-cache-e
	> xport.Ruvcei/g
	> > lobus-cache-export.Ruvcei.gpg
	> >
	> >>globus-cache-export.Ruvcei.gpg status=1, try=1
	> >>10/13/2003 17:21:23;0080;   pbs_mom;Fil;sys_copy;command:
	> >
	> > /usr/sbin/pbs_rcp
	> >
	> >>-r
	> >>
	> >
	> >
	> [log in to unmask]:/home/dteam004/globus-cache-e
	> xport.Ruvcei/g
	> > lobus-cache-export.Ruvcei.gpg
	> >
	> >>globus-cache-export.Ruvcei.gpg status=1, try=2
	> >>10/13/2003 17:21:34;0080;   pbs_mom;Fil;sys_copy;command:
	> /usr/bin/scp -Br
	> >>
	> >
	> >
	> [log in to unmask]:/home/dteam004/globus-cache-e
	> xport.Ruvcei/g
	> > lobus-cache-export.Ruvcei.gpg
	> >
	> >>globus-cache-export.Ruvcei.gpg status=1, try=3
	> >>10/13/2003 17:22:05;0080;   pbs_mom;Fil;sys_copy;command:
	> >
	> > /usr/sbin/pbs_rcp
	> >
	> >>-r
	> >>
	> >
	> >
	> [log in to unmask]:/home/dteam004/globus-cache-e
	> xport.Ruvcei/g
	> > lobus-cache-export.Ruvcei.gpg
	> >
	> >>globus-cache-export.Ruvcei.gpg status=1, try=4
	> >>10/13/2003 17:22:26;0004;
	> >>pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;Unable to copy file
	> >>globus-cache-export.Ruvcei.gpg from
	> >>
	> >
	> >
	> lcgce01.gridpp.rl.ac.uk:/home/dteam004/globus-cache-export.Ruv
	> cei/globus-cac
	> > he-export.Ruvcei.gpg
	> >
	> >>10/13/2003 17:22:26;0004;
	> >>pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;lcgce01.gridpp.rl.ac.uk:
	> >>Connection refused
	> >>10/13/2003 17:22:26;0004;
	> >>
	> >
	> >
	> pbs_mom;Fil;globus-cache-export.Ruvcei.gpg;ei/globus-cache-exp
	> ort.Ruvcei.gpg
	> > :
	> >
	> >>No such file or directory
	> >>10/13/2003 17:22:26;0008;   pbs_mom;Req;del_files;cannot stat
	> >>globus-cache-export.Ruvcei.gpg
	> >>10/13/2003 17:47:04;0001;   pbs_mom;Svr;pbs_mom;im_eof, End
	> of File from
	> >>addr 130.246.183.182:1023
	> >>
	> >>The problem appears to be that the job manager system is
	> not creating the
	> >>appropriate bundle for the client pbs_mom to copy - the job
	> start fails,
	> >
	> > and
	> >
	> >>the copy-back also fails.  How many retries
	> pbs_server/pbs_sched will make
	> >>is not clear.
	> >>
	> >>A CE reboot does not clear this problem.
	> >>
	> >>In the absence of any logs which might tell me which CE
	> service config is
	> >>b******d, Steve Traylen and I decided to reinstall and
	> reconfigure openpbs
	> >>from scratch (terminally removing all queued jobs).  This
	> appears to have
	> >>the desire effect - the CE is now running.
	> >>
	> >>I believe this is the second time I've been forced to resort to this
	> >
	> > tactic
	> >
	> >>for this problem - so does anyone know:
	> >>
	> >>a) where to look to find out why openpbs gets so confused,
	> or where the
	> >
	> > job
	> >
	> >>manager bit that interfaces to PBS kepps its logs/config data?
	> >>
	> >>b) what causes the problem in the first place?
	> >>
	> >>and
	> >>
	> >>c) why pbs_sched crashes?
	> >>
	> >>As of 17:08 today, we seem to be running OK.
	> >>
	> >>Martin.
	> >>
	> >>P.S. I'll be at HEPiX next week and on holiday the week
	> after - Steve
	> >>Traylen will be looking after our lcg1 nodes.
	> >
	> >
	> >
	> > --
	> > /------------------- Emanuele Leonardi -------------------\
	> > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
	> > |  IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23  |
	> > \---------------------------------------------------------/
	>
	>
	> --
	> /------------------- Emanuele Leonardi -------------------\
	> | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
	> |  IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23  |
	> \---------------------------------------------------------/
	>