Hi Steve, Jeff,
Steve Traylen wrote:
> What are you waiting for me for, must have missed that one?
>
> I think from Martin's description only these jobs at RAL are
> processes that are left
> over after the batch job has finished in torque? I think David G
> wrote some scripts for
> killing off these rogue processes at the end of jobs.
Actually, these processes are not 'rogue', since they are still part of
the process tree proper of the original pbs_mom-started user job. But by
forking enough, they manage to escape the the CPU time accounting.
Maybe the condor_master is not wait(2)-ing on its children, and thus is
causing their use to be lost to the tms_cutime sum from times(2).
But my standard reaper will leave these jobs alone, as they are still
part of the process tree of pbs_mom and not really "daemonized" (i.e. they
are not children of init).
Aren't children and their behaviour wonderful? ;-)
Cheers,
DavidG.
>
> Steve
>
> On Oct 26, 2006, at 10:23 AM, Jeff Templon wrote:
>
>> Hi,
>>
>> we have seen them, but they are associated with proper ATLAS jobs so
>> are not draining our farm. what may be fooling you is the metric you
>> use. indeed, if you use CPU time as the primary metric, these jobs
>> will have appeared to have drained your entire farm. for some
>> reason, the CPU time used by these jobs does not get properly
>> accounted for by Torque.
>>
>> On the other hand, the wall time *does* get accounted for. This is
>> one reason why I keep pleading for wall time being the primary
>> accounting metric.
>>
>> I asked the Traylenator took look into why the CPU time isn't getting
>> caught by Torque, haven't heard back from him yet. Other volunteers
>> are welcome. I 'spect Mr. Walker will pipe up and say
>> something soon.
>
>
>> My take: we need to figure out why torque doesn't catch the CPU time,
>> and we need to account wall time, otherwise I think these jobs are fine.
>>
>> JT
>>
>> Bly, MJ (Martin) wrote:
>>
>>> Hi all,
>>> We have some WNs here that appear to be running agents for the Condor
>>> system, trying to do work on our WNs in opposition to the Torque/Maui
>>> batch/scheduling system and unknown to it: atlassgm 16125 1 0
>>> Oct20 ? 00:01:08 condor_master -f
>>> atlassgm 2652 16125 0 Oct22 ? 00:04:25 condor_startd -f
>>> atlassgm 20839 2652 0 Oct25 ? 00:00:48 condor_starter -f
>>> higgs05.cs.wisc.edu
>>> atlassgm 20845 20839 0 Oct25 ? 00:00:00 /bin/sh --login /
>>> pool/4006441.csflnx353.rl.ac.uk/execute.130.246.180.112-16125/ dir_20839
>>> /condor_exec.ex
>>> atlassgm 21442 20845 92 Oct25 ? 22:09:41 ./2Qgen
>>> In the above case, jobid 4006441 has been and gone in the batch system.
>>> The big problem appears to be that this is causing grief to Maui which
>>> is refusing to schedule any legitimate work, thus draining the whole
>>> farm.
>>> Anyone else seen this?
>>> This is causing a big hassle: we are terminating all such processing in
>>> order to get our capacity back online.
>>> Martin
>>> Tier1 Systems.
>
>
--
David Groep
** National Institute for Nuclear and High Energy Physics, PDP/Grid group **
** Room: H1.56 Phone: +31 20 5922179, PObox 41882, NL-1009DB Amsterdam NL **
|