Hi Winnie,

wrt to misbehaving CMS jobs there was a long thread before Christmas starting here:
https://hypernews.cern.ch/HyperNews/CMS/get/comp-ops/3346.html

I don't think it was fully resolved as there seemed to be more than one issue.

regards,

Daniela

On 4 January 2017 at 12:17, Winnie Lacesso <[log in to unmask]> wrote:

Dear All

Starting IIRC in Dec (not happening before IIRC) we're seeing pilot jobs
that end up with 2 or more (3...7) completely separate "job threads", with
usually different pool accounts owning the process using ++CPU. On a WN
with (eg) 8 jobslots, that can mean their load instead of 8 is anywhere
from 15 to 40.

Question1: Aren't properly configured cgroups supposed to prevent this? Dr
Kreczko thinks so & says he believes cgroups are properly configured at
RAL but must not be at Bristol. It's been on my to-do for a while to look
into this - I vaguely grok cgroups but don't know how to check/test if
they are configured properly. Can anyone say how to check/test if cgroups
are / aren't properly configured & is this done on the CE or WN or both or
what?

2. Example of too many jobs/threads, edited:
root@sm23> pstree -lp 685580

condor_pid_ns_i(685580)-+-condor_exec.exe(1006801)---python2(1006870)-+-bash(1006981)---cmsRun(1007021)

This is one thread/job & top shows the process looking normal using as much
CPU as it can:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1007021 cmsprd 30 10 1228m 437m 1892 R 100.0 2.7 2032:56 cmsRun

Then there's another:
`-condor_startd(691563)---condor_starter(3754765)---glexec(3756553)---SNIP--bash(3757189)---cmsRun(3757299)---{cmsRun}(3757710)

This is another thread/job & top shows the process looking normal:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3757299 cms092 30 10 1662m 980m 39m R 97.4 6.2 629:23.31 cmsRun

I've seen up to 7 "threads" from one job.

In this case there are not too many jobs the & WN is not overloaded, but the
usual case is that these jobs end up battling lots of other similarly
many-threaded jobs & not getting much CPU at all, eg

650507 cmsprd 30 10 62412 18m 4472 R 62.4 0.1 0:00.32 python
650497 cmsprd 30 10 52180 26m 9916 R 48.7 0.2 0:00.34 cc1
650504 cmsprd 30 10 39352 13m 2356 R 37.0 0.1 0:00.24 python
1780455 cms934 30 10 2319m 1.5g 22m R 34.7 9.6 147:18.41 cmsRun
2038697 cmsprd 30 10 60648 16m 4536 R 30.9 0.1 0:00.54 python

So this makes all the jobs inefficient, which is stupid.

Question2: is anyone else seeing this, are atlas jobs doing this too?
Why are pilot jobs suddenly starting to do this weirdness?

Interspersed with those are lots & lots of perfectly single-threaded
well-behaved jobs, so it's just SOME ... jobs/threads whatever are
becoming multiheaded monsters!

Very Grateful for advice!

Sent from the pit of despair

-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/