Hi,
Here is what I have now:
if ($QueuedJobs > 0) {
my $TCPU = ( $MaxRunningJobs < $TotalCPU ) ? $MaxRunningJobs :
$TotalCPU;
$MaxTime=(($TotalJobs * $WallTime) - $UsedTime) / $TCPU;
if ( $MaxTime < 0){
$MaxTime=99999999;
}
} else {
$MaxTime = 0;
}
David McBride wrote:
> On Fri, 2005-12-16 at 11:04 +0100, Jeff Templon wrote:
>
> Hi Jeff,
>
>
>>my $TCPU = ( $MaxRunningJobs < $TotalCPU )? $MaxRunningJobs : $TotalCPU;
>>$MaxTime=(($TotalJobs * $WallTime) - $UsedTime) / $TCPU;
>>if ( $MaxTime < 0){
>> $MaxTime=99999999;
>> }
>
>
> Okay, this code would, when tidied up, look equivilent to this:
>
> # -------8<------------------------------------------------------------
>
> # Work out how many CPUs are available to service jobs in this queue.
> my $avail_cpus;
> if ($MaxRunningJobs < $TotalCPU) {
> # Only $MaxRunningJobs may run on this queue, despite the fact
> # we have more CPUs than jobs available.
> $avail_cpus = $MaxRunningJobs;
> }
> else {
> # The total number of CPUs on the cluster is our limiting
> # factor.
> $avail_cpus = $TotalCPU;
> }
>
> # Calculate the worst-case reponse time. In this simple model, every
> # job is expected to run right up it's wall_clock limit.
> # Once we have the maximum serial-runtime of *all* of the jobs
> # currently queued or running, we divide that by the number of CPUs
> # available to service jobs in this queue.
>
> my $wrt; # Our calculated worst-case-response time.
>
> # We assume that each job will take $WallTime to run. However,
> # we need to subtract the current total accumulated runtime of those
> # jobs already running to account for the fact that they have already
> # used some of their allotted time.
> my $serial_runtime = ($TotalJobs * $WallTime) - $UsedTime.
>
> # Now that we have the serial runtime of all currently running and
> # queuing jobs, we divide that by the number of available CPUs.
> $wrt = $serial_runtime / $avail_cpus;
>
> # Fudge: If, somehow, we've miscalculated and gotten a negative
> # worst-case response time, reset $wrt to A Large Number.
> if ($wrt < 0) { $wrt = 99999999 };
>
> # Our estimated response time is simply half our worst-case response
> # time.
>
> # -------8<------------------------------------------------------------
>
> (I haven't tested any of the above, there might be minor syntatical
> errors.)
>
>
>>I am going to make the change to QueuedJobs by hand here until I hear
>>something different. Hmm, on second thought that is even worse, since
>>MaxTime will be less than zero, so all ERTs will be huge.
>
>
> You don't want to do that. The code is trying to work out the
> worst-case runtime of every single job, queued and running. The upper
> limit on all of these jobs is clearly the wall-clock time limit.
>
> So, you need to work out how long the *current* jobs can run for -- ie:
>
> ($RunningJobs * $WallTime) - $UsedTime
>
> [ Where $RunningJobs == the number of jobs currently running on the
> cluster. This variable may not actually exist in the code, I made it
> up.]
>
> .. and you need to work out how long the queued jobs can run for -- ie:
>
> ($QueuedJobs * $WallTime)
>
> The worst-case response time is the sum of these two values.
>
> With a little bit of math:
>
> [1] $RunningJobs + $QueuedJobs = $TotalJobs
>
> [2] (($RunningJobs * $WallTime) - $UsedTime) +
> (($QueuedJobs * $WallTime) = $wrt
>
> [1,2] (($RunningJobs + $QueuedJobs) * $WallTime ) - $UsedTime = $wrt
>
> = ( $TotalJobs * $WallTime) - $UsedTime
>
> ... which is what the PBS code above calculates. If you change
> $TotalJobs to $QueuedJobs in the above calculation then the math just
> breaks. (As you observed, you'll usually get a negative $wrt.)
>
> If you removed the $UsedTime and changed $TotalJobs to $QueuedJobs, then
> you would effectively be ignoring the CPU time that has yet to be
> consumed by the jobs currently running on the cluster. That's probably
> not what you want.
>
> (Yes, I understand this too much. I had to understand WTF the above was
> doing when I came to implement my own lcg-info-dynamic-sge!)
>
> Cheers,
> David
|