Hi Winnie, wrt to misbehaving CMS jobs there was a long thread before Christmas starting here: https://hypernews.cern.ch/HyperNews/CMS/get/comp-ops/3346.html I don't think it was fully resolved as there seemed to be more than one issue. regards, Daniela On 4 January 2017 at 12:17, Winnie Lacesso <[log in to unmask]> wrote: > Dear All > > Starting IIRC in Dec (not happening before IIRC) we're seeing pilot jobs > that end up with 2 or more (3...7) completely separate "job threads", with > usually different pool accounts owning the process using ++CPU. On a WN > with (eg) 8 jobslots, that can mean their load instead of 8 is anywhere > from 15 to 40. > > Question1: Aren't properly configured cgroups supposed to prevent this? Dr > Kreczko thinks so & says he believes cgroups are properly configured at > RAL but must not be at Bristol. It's been on my to-do for a while to look > into this - I vaguely grok cgroups but don't know how to check/test if > they are configured properly. Can anyone say how to check/test if cgroups > are / aren't properly configured & is this done on the CE or WN or both or > what? > > 2. Example of too many jobs/threads, edited: > root@sm23> pstree -lp 685580 > > condor_pid_ns_i(685580)-+-condor_exec.exe(1006801)--- > python2(1006870)-+-bash(1006981)---cmsRun(1007021) > > This is one thread/job & top shows the process looking normal using as much > CPU as it can: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 1007021 cmsprd 30 10 1228m 437m 1892 R 100.0 2.7 2032:56 cmsRun > > Then there's another: > `-condor_startd(691563)---condor_starter(3754765)--- > glexec(3756553)---SNIP--bash(3757189)---cmsRun(3757299)---{ > cmsRun}(3757710) > > This is another thread/job & top shows the process looking normal: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 3757299 cms092 30 10 1662m 980m 39m R 97.4 6.2 629:23.31 cmsRun > > I've seen up to 7 "threads" from one job. > > In this case there are not too many jobs the & WN is not overloaded, but > the > usual case is that these jobs end up battling lots of other similarly > many-threaded jobs & not getting much CPU at all, eg > > 650507 cmsprd 30 10 62412 18m 4472 R 62.4 0.1 0:00.32 python > 650497 cmsprd 30 10 52180 26m 9916 R 48.7 0.2 0:00.34 cc1 > 650504 cmsprd 30 10 39352 13m 2356 R 37.0 0.1 0:00.24 python > 1780455 cms934 30 10 2319m 1.5g 22m R 34.7 9.6 147:18.41 cmsRun > 2038697 cmsprd 30 10 60648 16m 4536 R 30.9 0.1 0:00.54 python > > So this makes all the jobs inefficient, which is stupid. > > Question2: is anyone else seeing this, are atlas jobs doing this too? > Why are pilot jobs suddenly starting to do this weirdness? > > Interspersed with those are lots & lots of perfectly single-threaded > well-behaved jobs, so it's just SOME ... jobs/threads whatever are > becoming multiheaded monsters! > > Very Grateful for advice! > -- Sent from the pit of despair ----------------------------------------------------------- [log in to unmask] HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/