Hi,
Back in the day, when I was experimenting with whole node jobs, we had a similar-ish problem, where with queued whole node jobs, normal jobs wouldn't start. We solved it by setting MAXIJOBS on our whole node queue to 1 and increasing the global RESERVATIONDEPTH to 2 - so there was always at least 1 priority reservation for a normal job.
http://www.gridpp.rl.ac.uk/blog/2011/03/29/whole-node-jobs-implementation-and-aftermath/
Derek
On 30 May 2012, at 12:36, Stuart Purdie wrote:
> Some thoughts inlineds
>
> On 30 May 2012, at 11:50, Matt Doidge wrote:
>
>> Hello all,
>> A few weeks back we split off a portion of one of our queues for some
>> atlas whole-node job test on our torque/maui cluster. The split went
>> well, and the "multicore" queue is working well. But for some reason
>> maui has become lazy at scheduling jobs to the remaining "single-core
>> job" nodes, only keeping 2/3 of them full at any one time.
>
>
>
>> We're left
>> with ~100 free job slots at any one time and lots of waiting jobs.
>> Scheduling still occurs (all jobs seem to run eventually), but this is a
>> lot of cores going idle for no reason!
>
> How does that number of free job slots compare to the number of jobs or cores running/queueing for the multicore case?
>
>> Jobs can be forced to run on
>> these slots, but that's no way to run a batch system. After staring at
>> this till my eyes bleed I thought I'd ask my peers for help (after
>> Google forsook me).
>>
>> The technique we used to split the queues was simply to edit our torque
>> "nodes" file so that some nodes had the "mult" feature whiles the rest
>> were given the "sing" feature.
>
> You might consider experimenting with, instead of tagging nodes, useing partitions to split the mcore jobs, and then putting a partition request by default on the queue.
>
> We do something akin to this to keep analysis jobs off the oldest generation of worker nodes, for example.
>
> One advantage of partitions is that they (appear to be) implemented a bit simpler internally, so has better performance. It's possible that it's just taking Maui a long time to process each node with the 'features' plan. This also plays nicer with reservations (see below).
>
> You could also consider if you _need_ to dedicate specific nodes to the multicore (depends on hardware generations, of course). We don't, we just have a maximum limit on the total - which means that Atlas can end up with a difference number of cores sometimes compared to others - this is not necessarily a bad thing.
>
>> Then we gave our current, normal queue
>> the requirement to use nodes with the "sing" feature (using qmgr,
>> resources_default.neednodes = sing), and created a new queue "mcore"
>> that required nodes with the "mult" feature.
>
> You might want to look at the diagnose -r output.
>
> Multicore jobs work by Maui placing reservations on nodes for that job, and then scheduling around them. It sounds like some of these reservations might be 'leaking' onto undesired nodes. Look for 'job' reservations, and see if they're ending up where you expect
>
> Also, check the RESDEPTH - I can't recall the specific number, but you want something like twice the number of jobs per core, plus 1 per standing reservations, to ensure that there's space for maui to store them all. We had specific errors on that in the maui log, however.
>
> The showq -i output can be instructive. If you find that the multicore jobs are _always_ at the top of the queue, then maui tries extra hard to get them to run. It might be the case that reducing the priority on them just a little makes things work out better. (Perhaps a smaller basic priortiy, and a larger QUEUEWEIGHT or XFACTORWEIGHT, to give a stronger FIFO behaviour?) i.e. simpler scheduling.
>
> There's no solid answers in there, but hopefully that's got some areas you've not looked at yet (and in one of them lurks the answer).
--
Derek Ross ([log in to unmask])
Scientific Computing Technology Group
e-Science, STFC Rutherford Appleton Laboratory
+44 (0) 1235 445651
|