Hi Winnie,

wrt to misbehaving CMS jobs there was a long thread before Christmas starting here:
https://hypernews.cern.ch/HyperNews/CMS/get/comp-ops/3346.html

I don't think it was fully resolved as there seemed to be more than one issue.

regards,
Daniela

On 4 January 2017 at 12:17, Winnie Lacesso <[log in to unmask]> wrote:
Dear All

Starting IIRC in Dec (not happening before IIRC) we're seeing pilot jobs
that end up with 2 or more (3...7) completely separate "job threads", with
usually different pool accounts owning the process using ++CPU. On a WN
with (eg) 8 jobslots, that can mean their load instead of 8 is anywhere
from 15 to 40.

Question1: Aren't properly configured cgroups supposed to prevent this? Dr
Kreczko thinks so & says he believes cgroups are properly configured at
RAL but must not be at Bristol. It's been on my to-do for a while to look
into this - I vaguely grok cgroups but don't know how to check/test if
they are configured properly. Can anyone say how to check/test if cgroups
are / aren't properly configured & is this done on the CE or WN or both or
what?

2. Example of too many jobs/threads, edited:
root@sm23> pstree -lp 685580

condor_pid_ns_i(685580)-+-condor_exec.exe(1006801)---python2(1006870)-+-bash(1006981)---cmsRun(1007021)

This is one thread/job & top shows the process looking normal using as much
CPU as it can:

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1007021 cmsprd    30  10 1228m 437m 1892 R 100.0  2.7   2032:56 cmsRun

Then there's another:
`-condor_startd(691563)---condor_starter(3754765)---glexec(3756553)---SNIP--bash(3757189)---cmsRun(3757299)---{cmsRun}(3757710)

This is another thread/job & top shows the process looking normal:
    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
3757299 cms092    30  10 1662m 980m  39m R 97.4  6.2 629:23.31 cmsRun

I've seen up to 7 "threads" from one job.

In this case there are not too many jobs the & WN is not overloaded, but the
usual case is that these jobs end up battling lots of other similarly
many-threaded jobs & not getting much CPU at all, eg

 650507 cmsprd    30  10 62412  18m 4472 R 62.4  0.1   0:00.32 python
 650497 cmsprd    30  10 52180  26m 9916 R 48.7  0.2   0:00.34 cc1
 650504 cmsprd    30  10 39352  13m 2356 R 37.0  0.1   0:00.24 python
1780455 cms934    30  10 2319m 1.5g  22m R 34.7  9.6 147:18.41 cmsRun
2038697 cmsprd    30  10 60648  16m 4536 R 30.9  0.1   0:00.54 python

So this makes all the jobs inefficient, which is stupid.

Question2: is anyone else seeing this, are atlas jobs doing this too?
Why are pilot jobs suddenly starting to do this weirdness?

Interspersed with those are lots & lots of perfectly single-threaded
well-behaved jobs, so it's just SOME ... jobs/threads whatever are
becoming multiheaded monsters!

Very Grateful for advice!



--
Sent from the pit of despair

-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/