Print

Print


PS: Rob Fay, who is off today, knows a lot more than I do about this 
issue. I'll talk to him when he's back.

Steve


On 05/31/2012 10:31 AM, Stephen Jones wrote:
> Matt,
>
> this is just a hunch. When you have "runnable" single jobs, are there 
> unrunnable whole-node-jobs in the queue in front of them?
>
> Reason for asking: Maui pops from the job queue only until it hits the 
> first unrunnable (for whatever reason) job . So it never looks deeper 
> into the queue beyond  the first unrunnable job - there may be 
> runnable jobs in the queue but maui would never reach them. Instead, 
> it applies some tetris-style "backfilling" algorithm (which is is 
> broken).
>
> Anyway, the problem "may" be down to this scenario (it's an idea, 
> anyway). Say the queue is sorted as follows: W1,W2,W3,S1,S2,S3 (i.e. 
> three whole-node-jobs, three-single-jobs) and let us say his cluster 
> has two whole-worker-nodes (WN1,WN2) and two single-worker-nodes 
> (SN1,SN2). On a scheduler cycle, W1 and W2 would be dispatched to WN1 
> and WN2, leaving the queue as W3,S1,S2,S3. Maui cannot schedule the 
> next job (W3) as no node can take it. Maui does not look deeper into 
> the job queue as stated above. So, even though S1, S2 and S3 "could" 
> be scheduled, they are not scheduled. Instead, some broken 
> "backfilling" algorithm is invoked, that is supposed to "gap fill" the 
> other jobs. Like I said, it's broken in some way, so I am reliably 
> told - I don't know how, but it leaves queued jobs just sitting there 
> even when slots exist to run them.
>
> Summary: you'll only get the single-jobs to run when there are no 
> unrunnable whole-node-jobs in front of them. To test, kill the 
> unrunnable whole-node-jobs in front of the queued single-jobs - you 
> will then see the single-jobs start.
>
> Please let me know if this is the issue. I don't know any fix, yet. 
> I've been looking for a good excuse to fix this issue, by rolling our 
> own maui.
>
> Steve
>


-- 
Steve Jones                             [log in to unmask]
System Administrator                    office: 220
High Energy Physics Division            tel (int): 42334
Oliver Lodge Laboratory                 tel (ext): +44 (0)151 794 2334
University of Liverpool                 http://www.liv.ac.uk/physics/hep/