Dear All
Starting IIRC in Dec (not happening before IIRC) we're seeing pilot jobs
that end up with 2 or more (3...7) completely separate "job threads", with
usually different pool accounts owning the process using ++CPU. On a WN
with (eg) 8 jobslots, that can mean their load instead of 8 is anywhere
from 15 to 40.
Question1: Aren't properly configured cgroups supposed to prevent this? Dr
Kreczko thinks so & says he believes cgroups are properly configured at
RAL but must not be at Bristol. It's been on my to-do for a while to look
into this - I vaguely grok cgroups but don't know how to check/test if
they are configured properly. Can anyone say how to check/test if cgroups
are / aren't properly configured & is this done on the CE or WN or both or
what?
2. Example of too many jobs/threads, edited:
root@sm23> pstree -lp 685580
condor_pid_ns_i(685580)-+-condor_exec.exe(1006801)---python2(1006870)-+-bash(1006981)---cmsRun(1007021)
This is one thread/job & top shows the process looking normal using as much
CPU as it can:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1007021 cmsprd 30 10 1228m 437m 1892 R 100.0 2.7 2032:56 cmsRun
Then there's another:
`-condor_startd(691563)---condor_starter(3754765)---glexec(3756553)---SNIP--bash(3757189)---cmsRun(3757299)---{cmsRun}(3757710)
This is another thread/job & top shows the process looking normal:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3757299 cms092 30 10 1662m 980m 39m R 97.4 6.2 629:23.31 cmsRun
I've seen up to 7 "threads" from one job.
In this case there are not too many jobs the & WN is not overloaded, but the
usual case is that these jobs end up battling lots of other similarly
many-threaded jobs & not getting much CPU at all, eg
650507 cmsprd 30 10 62412 18m 4472 R 62.4 0.1 0:00.32 python
650497 cmsprd 30 10 52180 26m 9916 R 48.7 0.2 0:00.34 cc1
650504 cmsprd 30 10 39352 13m 2356 R 37.0 0.1 0:00.24 python
1780455 cms934 30 10 2319m 1.5g 22m R 34.7 9.6 147:18.41 cmsRun
2038697 cmsprd 30 10 60648 16m 4536 R 30.9 0.1 0:00.54 python
So this makes all the jobs inefficient, which is stupid.
Question2: is anyone else seeing this, are atlas jobs doing this too?
Why are pilot jobs suddenly starting to do this weirdness?
Interspersed with those are lots & lots of perfectly single-threaded
well-behaved jobs, so it's just SOME ... jobs/threads whatever are
becoming multiheaded monsters!
Very Grateful for advice!
|