On Thu, 12 May 2005, Burke, S (Stephen) wrote:
> Testbed Support for GridPP member institutes
> > [mailto:[log in to unmask]] On Behalf Of Henry Nebrensky said:
> > Or to put it another way, we have ~600 100hour jobs queued up
> > for a single worker node.
>
> Rod Walker recently reported something similar - who are the users?
Ricardo Graciani - i.e. LHCb jobs.
> > This does raise a few minor questions, like doesn't the RB
> > keep track of
> > where it has recently sent jobs, in order to make sensible
> > choices about
> > where to send later ones... (a 3-year ETT can't be good!)
>
> No, the RB doesn't keep track of anything. If people use the default ERT
> ranking it should indeed stop jobs piling up, although there are still a
> lot of problems with the way it's calculated. However, the Rank is set
> by the user and if they pick something different there is nothing to
> stop all jobs going to the same place.
>
> > But my actual questions are:
> > 1) Just how much caching is in the GRIS-GIIS-BDII system?
>
> It shouldn't be more than a few minutes - although you should add PBS
> itself into the list of things which cause delays. Try asking Laurence
> if you're seeing very long delays.
It looks like it was well over 24hrs - the current job arrived Wed.
evening and started running at 9 a.m.
So we seem to have been publishing stale info for quite a long while...
I'd suspect the CE GRIS of falling over, but it restarted without
complaining.
Unhelpfully, restarting the lcg-bdii service also wipes out its log file.
> > We have a second node free which is supposed to be dedicated
> > to the short
> > queue, so that the monitoring jobs can get in past any jobs on the
> > production node, and indeed things like the SFT do prefer to use that
> > short queue node. How can I find out what they're stuck waiting for?
>
> You could try qstat -f on the job IDs.
Well, the running job has
Resource_List.cput = 80:00:00
Resource_List.neednodes = 1#qlong
Resource_List.nodect = 1
Resource_List.nodes = 1#qlong
...
comment = Job started on Thu May 12 at 09:34
etime = Wed May 11 16:42:16 2005
while the production one stuck behind it (2337) has
Resource_List.cput = 80:00:00
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 100:00:00
...
comment = Not Running: Not enough of the right type of nodes are available
etime = Wed May 11 16:43:16 2005
Which makes sense. The first stuck short job has
Resource_List.cput = 00:59:59
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 02:00:00
...
comment = Not Running: Draining system to allow starving job to run
etime = Thu May 12 17:58:48 2005
as do those behind it. (including my test job, which arrived from the RB
~5 hours after being submitted... and ~4hrs after the cancel request...)
Hmm: the other infinite queue jobs also have
comment = Not Running: Draining system to allow starving job to run
The PBS scheduler log has
05/13/2005 00:42:57;0040; pbs_sched;Job;3002.dgc-grid-35.brunel.ac.uk;Draining system to allow 2337.dgc-grid-35.brunel.ac.uk to run
Could 2337 be some MPI thing? But it only wants one node and other short
queue jobs have run since the current long one started. The only thing I
can think of is that the PBS server/scheduler on the CE has managed to
"forget" that the short queue node services the short queue...
Are pbs_sched/pbs_server safe to restart while things are running?
Henry
--
Dr. Henry Nebrensky [log in to unmask]
http://people.brunel.ac.uk/~eesrjjn
"The opossum is a very sophisticated animal.
It doesn't even get up until 5 or 6 p.m."
|