On Fri, Aug 24, 2012 at 12:39:37PM +0200, German Gomez-Herrero wrote:
> I am not a SGE expert but I don't think using h_rt was the right thing to do in
> the first place. It has the obvious disadvantage of killing processes if the
> h_rt estimates are wrong, in a way that can be difficult to track for a novice
> user. On the other hand the approach of using h_rt to influence SGE's queue
> sorting strategy is doomed to fail in many cases. For instance, in our case we
> don't set any h_rt in any of our queues but we set a much lower number of slots
> in our long.q so that we always have enough computing power available for
> shorter (usually more urgent) jobs.
> In this scenario setting the h_rt at
> fsl_sub has no advantages (or has it?), since all queues fulfill the job
> requirements in terms of h_rt, any queue will be as good as any other for any
> job. This might be in fact counterproductive. I have not checked this so I am
> not sure, but I would guess that a very short job might end up in our
> verylong.q, just because short.q has a higher load at a given moment. As far as
> I know, once a queue fulfills the resources requirements, such requirements
> have no effect whatsoever on queue sorting, and the sorting is decided based on
> simple load considerations.
The scenario that you are describing (without enforced limits) can work
for you, and indeed h_rt has no positive effect. However, your scenario
relies on the assumption of cooperative users. A single misbehaving user
(submits lots of long jobs to a "short" queue) jeopardizes your cluster.
In the places where I have worked this happens automatically once you
have more than a handful of users, or at least two that do not meet at
the same coffee machine ;-)
> At this moment we have only two machines (one of them a server with 24 cores)
> in our grid but we are planning to connect many more in the future. Our
> mini-grid is shared by 15 people or so and that is why we use various queues to
> use our resources more effectively. But anyways, this is a pointless
> discussion. We have been using SGE for a long time now and we are quite
> satisfied. Condor looks like a great complement and we will surely give it a
> try.
Hmm, not sure what you are saying. The default setup of the Debian FSL
is made so it works with as many SGE deployments as possible without
having to edit anything. It doesn't work for your particular case -- that
isn't nice, but you are in no worse position than with the original
fsl_sub from FMRIB -- you have to edit the file. Even if you tailor your
SGE instance to exactly mimic the one in Oxford, you still want to have
SGE's email sent to your domain.
> However, if neurodebian wants to promote Condor, then why modifying the
> way fsl_sub submits jobs to SGE?
We do not decide on behalf of our users what they should consider the
best tool for a job (although we do have opinions). We try to offer as
many choices as possible. Our job is to integrate software into a larger
distribution ecosystem, and I tend to believe that the _slightly_
modified job submission of FSL led to improved integration with the rest
of Debian (such as Debian's SGE package). In three years you are the first
one reporting a problem with this setup.
But back to the actual problem: Does anyone SGE-savvy know whether SGE
treats s_rt like h_rt for resource matching? If that is the case we
maybe could get around adding a multiplier mechanism that Mark suggested
in another email in this thread.
Michael
--
Michael Hanke
http://mih.voxindeserto.de
|