Hi,
we have seen them, but they are associated with proper ATLAS jobs so are
not draining our farm. what may be fooling you is the metric you use.
indeed, if you use CPU time as the primary metric, these jobs will have
appeared to have drained your entire farm. for some reason, the CPU
time used by these jobs does not get properly accounted for by Torque.
On the other hand, the wall time *does* get accounted for. This is one
reason why I keep pleading for wall time being the primary accounting
metric.
I asked the Traylenator took look into why the CPU time isn't getting
caught by Torque, haven't heard back from him yet. Other volunteers are
welcome. I 'spect Mr. Walker will pipe up and say something soon.
My take: we need to figure out why torque doesn't catch the CPU time,
and we need to account wall time, otherwise I think these jobs are fine.
JT
Bly, MJ (Martin) wrote:
> Hi all,
>
> We have some WNs here that appear to be running agents for the Condor
> system, trying to do work on our WNs in opposition to the Torque/Maui
> batch/scheduling system and unknown to it:
>
> atlassgm 16125 1 0 Oct20 ? 00:01:08 condor_master -f
> atlassgm 2652 16125 0 Oct22 ? 00:04:25 condor_startd -f
> atlassgm 20839 2652 0 Oct25 ? 00:00:48 condor_starter -f
> higgs05.cs.wisc.edu
> atlassgm 20845 20839 0 Oct25 ? 00:00:00 /bin/sh --login
> /pool/4006441.csflnx353.rl.ac.uk/execute.130.246.180.112-16125/dir_20839
> /condor_exec.ex
> atlassgm 21442 20845 92 Oct25 ? 22:09:41 ./2Qgen
>
> In the above case, jobid 4006441 has been and gone in the batch system.
>
>
> The big problem appears to be that this is causing grief to Maui which
> is refusing to schedule any legitimate work, thus draining the whole
> farm.
>
> Anyone else seen this?
>
> This is causing a big hassle: we are terminating all such processing in
> order to get our capacity back online.
>
> Martin
> Tier1 Systems.
|