All,
I've got the timeleft script in front of me and I can see running jobs
with plenty of logging on.
I can step through this output tomorrow, compare the result of
TimeLeft.py with
the truth, and see if it tallies.
Hopefully, I can say something definitive soon.
Cheers,
Steve
On 12/03/2014 03:31 PM, Raja Nandakumar wrote:
> https://github.com/DIRACGrid/DIRAC/blob/5de889d3ed87259cf5de19b246dbee858a41896a/Core/Utilities/TimeLeft/TimeLeft.py
>
>
> Cheers,
> Raja.
>
> On 03/12/14 16:13, Stephen Jones wrote:
>> Vladimir, Mark, (Raja),
>>
>>
>> Note: Posting to TB_SUPPORT for general information
>>
>>
>> I have a theory about the behaviour of our CREAM/Torque cluster ,
>> with respect to pillhb jobs that get timed out (forgetting about
>> ARC/Condor).
>>
>> In Torque, jobs are killed when they run for more than the maximum
>> “walltime”, which is set to 48 hours here. But the cputime and
>> walltime are scaled (see here:
>> https://www.gridpp.ac.uk/wiki/Publishing_tutorial) We calculated a
>> set of factors for various cpus at our site, e.g.
>>
>> BASELINE 1.0 ( abstract node type used for comparison/measurement)
>>
>> L5420 0.896
>>
>> E5620 1.205
>>
>> As a result, there are two types of “walltime”. There is
>> REAL_walltime, which is the same as reading the clock on the wall,
>> and there is SCALED _walltime, which is the REAL_walltime multiplied
>> by the scaling factor. The killing of jobs is based on the SCALED
>> _walltime. Examples (assuming SCALED_walltime limit of 48 hours):
>>
>> Job on L5420, REAL_walltime = 48 / 0.896, which is 53 hours, 34
>> minutes of real time.
>>
>> Job on L5620, REAL_walltime = 48 / 1.205, which is 39 hours, 50
>> minutes of real time.
>>
>> So jobs can be killed before or after 48 hours depending on the node
>> type. If a job ran on a L5420, it would have 53+ hours and less than
>> 40 hours on a L5620. There are no L5620 nodes in the cluster now, but
>> we do have records dating back when there were. Looking at this
>> period, there were no pillhb jobs killed off on L5420 nodes, and many
>> were killed on the other node types, corroborating this theory. This
>> finding perhaps suggests jobs are not using scaling factor when they
>> calculate how long they have left. Hence some run out of time. On
>> nodes where REAL_walltime exceeds SCALED _walltime (i.e. L5420), the
>> jobs always had enough time to finish. On nodes where REAL walltime
>> is less than SCALED _walltime, the jobs get killed sometimes,
>> depending on how close they get.
>>
>> Please let me know how pillhb jobs compute "time left" so we can make
>> sure. I need to eliminate that before I can do much more on this.
>>
>> Cheers,
>>
>>
>> Steve
>>
>>
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|