Hi Winnie,
There are a few simple checks you can do to check if HTCondor is actually using cgroups. Firstly, check if cgroups are being created for each job. E.g. type "lscgroup" and you should see something like this [1].
Next, check if cpu shares are being set correctly for each job by looking at the values of "cpu.shares". For a single core job:
[root@lcg1984 ~]# cat /cgroup/cpu/htcondor/condor_pool_condor_slot1_2\@lcg1984.gridpp.rl.ac.uk/cpu.shares
100
and for a multicore job:
[root@lcg1984 ~]# cat /cgroup/cpu/htcondor/condor_pool_condor_slot1_3\@lcg1984.gridpp.rl.ac.uk/cpu.shares
800
Regards,
Andrew.
[1]
[root@lcg1984 ~]# lscgroup
cpuset:/
cpu:/
cpu:/htcondor
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
cpu:[log in to unmask]
...
________________________________________
From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Winnie Lacesso [[log in to unmask]]
Sent: Wednesday, January 04, 2017 12:17 PM
To: [log in to unmask]
Subject: pilot jobs -> 2 (up to 7) jobs, but not multicore? cgroups???
Dear All
Starting IIRC in Dec (not happening before IIRC) we're seeing pilot jobs
that end up with 2 or more (3...7) completely separate "job threads", with
usually different pool accounts owning the process using ++CPU. On a WN
with (eg) 8 jobslots, that can mean their load instead of 8 is anywhere
from 15 to 40.
Question1: Aren't properly configured cgroups supposed to prevent this? Dr
Kreczko thinks so & says he believes cgroups are properly configured at
RAL but must not be at Bristol. It's been on my to-do for a while to look
into this - I vaguely grok cgroups but don't know how to check/test if
they are configured properly. Can anyone say how to check/test if cgroups
are / aren't properly configured & is this done on the CE or WN or both or
what?
2. Example of too many jobs/threads, edited:
root@sm23> pstree -lp 685580
condor_pid_ns_i(685580)-+-condor_exec.exe(1006801)---python2(1006870)-+-bash(1006981)---cmsRun(1007021)
This is one thread/job & top shows the process looking normal using as much
CPU as it can:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1007021 cmsprd 30 10 1228m 437m 1892 R 100.0 2.7 2032:56 cmsRun
Then there's another:
`-condor_startd(691563)---condor_starter(3754765)---glexec(3756553)---SNIP--bash(3757189)---cmsRun(3757299)---{cmsRun}(3757710)
This is another thread/job & top shows the process looking normal:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3757299 cms092 30 10 1662m 980m 39m R 97.4 6.2 629:23.31 cmsRun
I've seen up to 7 "threads" from one job.
In this case there are not too many jobs the & WN is not overloaded, but the
usual case is that these jobs end up battling lots of other similarly
many-threaded jobs & not getting much CPU at all, eg
650507 cmsprd 30 10 62412 18m 4472 R 62.4 0.1 0:00.32 python
650497 cmsprd 30 10 52180 26m 9916 R 48.7 0.2 0:00.34 cc1
650504 cmsprd 30 10 39352 13m 2356 R 37.0 0.1 0:00.24 python
1780455 cms934 30 10 2319m 1.5g 22m R 34.7 9.6 147:18.41 cmsRun
2038697 cmsprd 30 10 60648 16m 4536 R 30.9 0.1 0:00.54 python
So this makes all the jobs inefficient, which is stupid.
Question2: is anyone else seeing this, are atlas jobs doing this too?
Why are pilot jobs suddenly starting to do this weirdness?
Interspersed with those are lots & lots of perfectly single-threaded
well-behaved jobs, so it's just SOME ... jobs/threads whatever are
becoming multiheaded monsters!
Very Grateful for advice!
|