Hi Alessandra,
Quick unrelated not, if you use “-match 1” condor_history will stop looking after its first match otherwise it searches through its entire history to the end, which if you have a lot of entries can be sloooow.
To be honest though as nice is condor_history is grep’ing the history file is usually _much_ faster :)
Thanks,
Gareth
On 17/10/2017, 17:44, "Testbed Support for GridPP member institutes on behalf of Alessandra Forti" <[log in to unmask] on behalf of [log in to unmask]> wrote:
Hi,
ok that was the wrong example. condor_history is extremely slow so it is
not easy to find the right job with the tools. This is a better example
condor_history 66469.0 -autoformat ClusterId Owner RequestMemory
ResidentSetSize_RAW ExitCode
66469 atlpil017 2000 34723028 0
cheers
alessandra
On 17/10/2017 16:37, Andrew Lahiff wrote:
> Hi Alessandra,
>
> Note that RequestMemory is in MB and ResidentSetSize_RAW is in Kbytes, so that job shouldn't have been killed :-)
>
> Regards,
> Andrew.
>
> ________________________________
> From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Alessandra Forti [[log in to unmask]]
> Sent: Tuesday, October 17, 2017 4:28 PM
> To: [log in to unmask]
> Subject: Re: arc, htcondor, cgroups limit setup
>
> PSFor example looking at one of these jobs
>
> condor_history 64000.0 -autoformat ClusterId Owner RequestMemory ResidentSetSize_RAW ExitCode
> 64000 atlpil017 2000 21852 0
>
> it's clear that the SYSTEM_PERIODIC_REMOVE remove doesn't work. 😒
>
>
> On 17/10/2017 16:12, Alessandra Forti wrote:
> Hi Andrew,
>
> On 12/10/2017 17:01, Andrew Lahiff wrote:
>
> Hi Alessandra,
>
> We're currently using 2x rather than 3x.
>
> By default on *7 HTCondor has:
>
> BASE_CGROUP=htcondor
>
> So for memory, for example, the cgroups for jobs appear in the usual place:
>
> /sys/fs/cgroup/memory/htcondor/...
>
> Why are you using BASE_CGROUP=/system.slice/condor.service?
>
> because when discussing CentOS7 worker nodes that was the setup suggested by Brian to DESY.
>
> https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-March/msg00100.shtml
>
> and in a follow up to me. I can see the jobs in cgroup
>
> [root@wn1904300 ~]# systemd-cgtop -n 1
>
> Path Tasks %CPU Memory Input/s Output/s
>
> / 238 - 43.8G - -
> /system.slice 65 - 43.8G - -
> /system.slice/atd.service 1 - - - -
> /system.slice/auditd.service 1 - - - -
> /system.slice/autofs.service 13 - - - -
> /system.slice/chronyd.service 1 - - - -
> /system.slice/condor.service 295 - 43.2G - -
> [log in to unmask] 11 - 970.2M - -
> [log in to unmask] 5 - 1.3G - -
> [log in to unmask] 13 - 4.3G - -
> [....]
>
> the amount of memory reported is also the same I read with ps_mem from /proc/<PID>/smaps so that works ok. What doesn't work is that the jobs don't get killed if they exceed the limit set by RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory ) in the SYSTEM_PERIODIC_REMOVE as I expected.
>
> I've now found this more recent presentation that suggests to set the limits also in cgroups
>
> https://research.cs.wisc.edu/htcondor/HTCondorWeek2017/presentations/WedDownes_cgroups.pdf
>
> cheers
> alessandra
>
>
>
>
> When jobs are running can you see memory cgroups being successfully created?
>
> Regards,
> Andrew.
>
> ________________________________
> From: Testbed Support for GridPP member institutes [[log in to unmask]<mailto:[log in to unmask]>] on behalf of Alessandra Forti [[log in to unmask]<mailto:[log in to unmask]>]
> Sent: Thursday, October 12, 2017 4:30 PM
> To: [log in to unmask]<mailto:[log in to unmask]>
> Subject: arc, htcondor, cgroups limit setup
>
> Hi,
>
> our ARC/HTcondor setup is according to the recommendations on the gridpp wiki. I particular we have the RAL receipe [1] with slightly more restrictive values 2* rather than 3* (if Andrew L hasn't changed since then)
>
> RemoveMemoryUsage = ( ResidentSetSize_RAW > 2000*RequestMemory )
>
> we also have cgroups enabled
>
> # Enable CGROUP
> BASE_CGROUP = /system.slice/condor.service
> CGROUP_MEMORY_LIMIT = soft
>
> however today a user managed to run jobs that were using 13-20 times the memory requested and the system didn't do anything.
>
> Am I doing something wrong? Should I put also specific limits in cgroups? At the moment I have no memory limit set for htcondor
>
> systemctl show htcondor |grep -i mem
> MemoryCurrent=18446744073709551615
> MemoryAccounting=no
> MemoryLimit=18446744073709551615
> LimitMEMLOCK=65536
>
> Also has anyone tried the cgroup accounting? That might be interesting.
>
> thanks
>
> cheers
> alessandra
>
>
> [1] https://www.gridpp.ac.uk/wiki/Enable_Cgroups_in_HTCondor#RAL_Modifications
>
> --
> Respect is a rational process. \\//
> Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
> For Ur-Fascism, disagreement is treason. (U. Eco)
> But but but her emails... covfefe!
>
>
>
> --
> Respect is a rational process. \\//
> Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
> For Ur-Fascism, disagreement is treason. (U. Eco)
> But but but her emails... covfefe!
>
>
> --
> Respect is a rational process. \\//
> Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
> For Ur-Fascism, disagreement is treason. (U. Eco)
> But but but her emails... covfefe!
--
Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... covfefe!
|