[root@dgc-grid-35 root]# qstat -q
server: dgc-grid-35.brunel.ac.uk
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
short -- 00:59:59 02:00:00 -- 0 5 -- E R
long -- 12:00:00 24:00:00 -- 0 0 -- D S
infinite -- 80:00:00 100:00:0 -- 1 581 -- D R
--- ---
1 586
Or to put it another way, we have ~600 100hour jobs queued up for a single
worker node.
This does raise a few minor questions, like doesn't the RB keep track of
where it has recently sent jobs, in order to make sensible choices about
where to send later ones... (a 3-year ETT can't be good!)
But my actual questions are:
1) Just how much caching is in the GRIS-GIIS-BDII system? I only noticed a
problem when I had trouble submitting a local test job, with the GIIS
monitor listing us as empty. It took another half hour, including
restarts of the CE GRIS and GIIS/BDII (both apparently running fine -
no error messages when stopping them) before anything off-site seemed
to realise we're swamped under - if I hadn't already set the queue to
drain goodness knows how many more jobs we'd have sucked in during that
time. And closer inspection suggests they've been queueing up since
last night (i.e. 24hrs ago)...
2) Why have we got 5 short-queue jobs stuck there???
[root@dgc-grid-35 root]# pbsnodes -a
dgc-grid-36.brunel.ac.uk
state = free
np = 1
speed = 0
properties = lcgpro,qshort
ntype = cluster
dgc-grid-37.brunel.ac.uk
state = job-exclusive
np = 1
speed = 0
properties = lcgpro,qshort,qlong
ntype = cluster
jobs = 0/2336.dgc-grid-35.brunel.ac.uk
Qmgr: list queue short
Queue short
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:5 Held:0 Waiting:0 Running:0 Exiting:0
resources_max.cput = 00:59:59
resources_max.walltime = 02:00:00
resources_assigned.nodect = 0
required_property = qshort
enabled = True
started = True
Qmgr: list queue infinite
Queue infinite
queue_type = Execution
total_jobs = 582
state_count = Transit:0 Queued:581 Held:0 Waiting:0 Running:1 Exiting:0
resources_max.cput = 80:00:00
resources_max.walltime = 100:00:00
resources_assigned.nodect = 1
required_property = qlong
enabled = False
started = True
We have a second node free which is supposed to be dedicated to the short
queue, so that the monitoring jobs can get in past any jobs on the
production node, and indeed things like the SFT do prefer to use that
short queue node. How can I find out what they're stuck waiting for?
Thanks
Henry
--
Dr. Henry Nebrensky [log in to unmask]
http://people.brunel.ac.uk/~eesrjjn
"The opossum is a very sophisticated animal.
It doesn't even get up until 5 or 6 p.m."
|