Print

Print


Hi Winnie,

wrt to misbehaving CMS jobs there was a long thread before Christmas
starting here:
https://hypernews.cern.ch/HyperNews/CMS/get/comp-ops/3346.html

I don't think it was fully resolved as there seemed to be more than one
issue.

regards,
Daniela

On 4 January 2017 at 12:17, Winnie Lacesso <[log in to unmask]>
wrote:

> Dear All
>
> Starting IIRC in Dec (not happening before IIRC) we're seeing pilot jobs
> that end up with 2 or more (3...7) completely separate "job threads", with
> usually different pool accounts owning the process using ++CPU. On a WN
> with (eg) 8 jobslots, that can mean their load instead of 8 is anywhere
> from 15 to 40.
>
> Question1: Aren't properly configured cgroups supposed to prevent this? Dr
> Kreczko thinks so & says he believes cgroups are properly configured at
> RAL but must not be at Bristol. It's been on my to-do for a while to look
> into this - I vaguely grok cgroups but don't know how to check/test if
> they are configured properly. Can anyone say how to check/test if cgroups
> are / aren't properly configured & is this done on the CE or WN or both or
> what?
>
> 2. Example of too many jobs/threads, edited:
> root@sm23> pstree -lp 685580
>
> condor_pid_ns_i(685580)-+-condor_exec.exe(1006801)---
> python2(1006870)-+-bash(1006981)---cmsRun(1007021)
>
> This is one thread/job & top shows the process looking normal using as much
> CPU as it can:
>
>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 1007021 cmsprd    30  10 1228m 437m 1892 R 100.0  2.7   2032:56 cmsRun
>
> Then there's another:
> `-condor_startd(691563)---condor_starter(3754765)---
> glexec(3756553)---SNIP--bash(3757189)---cmsRun(3757299)---{
> cmsRun}(3757710)
>
> This is another thread/job & top shows the process looking normal:
>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 3757299 cms092    30  10 1662m 980m  39m R 97.4  6.2 629:23.31 cmsRun
>
> I've seen up to 7 "threads" from one job.
>
> In this case there are not too many jobs the & WN is not overloaded, but
> the
> usual case is that these jobs end up battling lots of other similarly
> many-threaded jobs & not getting much CPU at all, eg
>
>  650507 cmsprd    30  10 62412  18m 4472 R 62.4  0.1   0:00.32 python
>  650497 cmsprd    30  10 52180  26m 9916 R 48.7  0.2   0:00.34 cc1
>  650504 cmsprd    30  10 39352  13m 2356 R 37.0  0.1   0:00.24 python
> 1780455 cms934    30  10 2319m 1.5g  22m R 34.7  9.6 147:18.41 cmsRun
> 2038697 cmsprd    30  10 60648  16m 4536 R 30.9  0.1   0:00.54 python
>
> So this makes all the jobs inefficient, which is stupid.
>
> Question2: is anyone else seeing this, are atlas jobs doing this too?
> Why are pilot jobs suddenly starting to do this weirdness?
>
> Interspersed with those are lots & lots of perfectly single-threaded
> well-behaved jobs, so it's just SOME ... jobs/threads whatever are
> becoming multiheaded monsters!
>
> Very Grateful for advice!
>



-- 
Sent from the pit of despair

-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/