Hi Andrew,
That's right, in my initial tests there were no obvious signs of problems. I've tried to find a way to reproduce the problem on demand but haven't been able to so far.
We're possibly still one of the few sites which are using pid namespaces, which is perhaps why others haven't noticed any problems. BTW in our case we don't need to worry about cleaning up /tmp as each job has it's own /tmp (due to the magic of namespaces & bind mounts).
Regards,
Andrew.
________________________________________
From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Andrew McNab [[log in to unmask]]
Sent: Wednesday, May 25, 2016 11:19 AM
To: [log in to unmask]
Subject: Re: HTCondor Machine/Job Features testing?
> On 25 May 2016, at 10:32, Andrew Lahiff <[log in to unmask]> wrote:
>
> Hi,
>
> Have any UK sites which have installed this experienced problems? We had deployed this across all worker nodes yesterday but had to disable it last night due to it causing significant problems to ATLAS jobs. We were getting lots of errors like this in the ShadowLog:
>
> 05/24/16 23:11:56 (12080141.0) (2494821): ERROR "Error from [log in to unmask]: Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 562 in file /slots/08/dir_7758/userdir/.tmpJ7h9JP/BUILD/condor-8.4.6/src/condor_shadow.V6.1/pseudo_ops.cpp
>
> And in the starter logs errors like this:
>
> 05/24/16 23:11:56 (pid:2560243) JobReaper: condor_pid_ns_init didn't drop filename /pool/condor/dir_2560243/.condor_pid_ns_status (2)
> 05/24/16 23:11:56 (pid:2560243) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 765 in file /slots/01/dir_4917/userdir/.tmpKEQAWF/BUILD/condor-8.4.4/src/condor_starter.V6.1/vanilla_proc.cpp
>
> The same jobs were then being run over and over again, with SYSTEM_PERIODIC_REMOVE unable to kill them for some reason.
>
> I think the problem (or at least part of it) is that /usr/sbin/mjf-job-wrapper doesn't satisfy the requirements for a USER_JOB_WRAPPER: "This wrapper script must ultimately replace its image with the user job; thus, it must exec() the user job, not fork() it. “.
Sorry about this. I based the script on a section of the admin guide about USER_JOB_WRAPPER that doesn’t mention that requirement. I’ll issue updated scripts/RPMs where the wrapper script exec’s the real job and we can rely on the OS to clear up the small jobfeatures directories created in /tmp instead of doing it explicitly in the wrapper after the job finishes.
> I'm not sure why only ATLAS jobs seemed to be affected (ATLAS SAM tests were fine, however).
I’m guessing the ones that ran on the machines you did the initial tests on were ok too? Just luck about which jobs maybe?
Thanks,
Andrew
> I think the safest thing for us to do is to add (most of) the contents of /usr/sbin/mjf-job-wrapper to our ARC ENV/GLITE runtime environment, rather than try to use a job wrapper. But for the moment at least we have mjf disabled completely.
>
> Thanks,
> Andrew.
>
> ________________________________________
> From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Andrew McNab [[log in to unmask]]
> Sent: Wednesday, May 04, 2016 2:46 PM
> To: [log in to unmask]
> Subject: HTCondor Machine/Job Features testing?
>
> Hi,
>
> It would be really helpful to have a couple of UK sites who are running HTCondor, and would be prepared to test the new Machine/Job Features scripts. It just involves installing an RPM and making a one-line change to your HTCondor config. Any volunteers?
>
> Thanks,
>
> Andrew
>
> --
> Dr Andrew McNab
> University of Manchester High Energy Physics,
> LHCb@CERN (Distributed Computing Coordinator),
> and GridPP (LHCb + Tier-2 Evolution)
> www.hep.manchester.ac.uk/u/mcnab
> Skype: andrew.mcnab.uk
Cheers
Andrew
--
Dr Andrew McNab
University of Manchester High Energy Physics,
LHCb@CERN (Distributed Computing Coordinator),
and GridPP (LHCb + Tier-2 Evolution)
www.hep.manchester.ac.uk/u/mcnab
Skype: andrew.mcnab.uk
|