Hi Andrew,
I've not seen this problem here. But I did replace our job wrapper with the following, to get around the exec() issues :-
#!/bin/bash
# Ensure GLITE environment variables are set
f [[ -z "$GLITE_ENV_SET" && -n "$GRID_GLOBAL_JOBID" ]]
then
. /etc/arc/runtime/ENV/GLITE
fi
# mjf-job-wrapper script for Machine/Job Features on HTCondor
#
export JOBFEATURES=`/usr/sbin/make-jobfeatures`
# This variable should be coming from /etc/profile/mjf.[c]sh too!
if [ -d /etc/machinefeatures ] ; then
export MACHINEFEATURES=/etc/machinefeatures
fi
$*
# We tidy these up. They are in /tmp by default so we could just
# leave them to be removed by the system instead.
if [ -d "$JOBFEATURES" ] ; then
rm -Rf "$JOBFEATURES"
fi
Regards,
Ian
-----Original Message-----
From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Andrew Lahiff
Sent: 25 May 2016 10:33
To: [log in to unmask]
Subject: Re: HTCondor Machine/Job Features testing?
Hi,
Have any UK sites which have installed this experienced problems? We had deployed this across all worker nodes yesterday but had to disable it last night due to it causing significant problems to ATLAS jobs. We were getting lots of errors like this in the ShadowLog:
05/24/16 23:11:56 (12080141.0) (2494821): ERROR "Error from [log in to unmask]: Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 562 in file /slots/08/dir_7758/userdir/.tmpJ7h9JP/BUILD/condor-8.4.6/src/condor_shadow.V6.1/pseudo_ops.cpp
And in the starter logs errors like this:
05/24/16 23:11:56 (pid:2560243) JobReaper: condor_pid_ns_init didn't drop filename /pool/condor/dir_2560243/.condor_pid_ns_status (2)
05/24/16 23:11:56 (pid:2560243) ERROR "Starter configured to use PID NAMESPACES, but libexec/condor_pid_ns_init did not run properly" at line 765 in file /slots/01/dir_4917/userdir/.tmpKEQAWF/BUILD/condor-8.4.4/src/condor_starter.V6.1/vanilla_proc.cpp
The same jobs were then being run over and over again, with SYSTEM_PERIODIC_REMOVE unable to kill them for some reason.
I think the problem (or at least part of it) is that /usr/sbin/mjf-job-wrapper doesn't satisfy the requirements for a USER_JOB_WRAPPER: "This wrapper script must ultimately replace its image with the user job; thus, it must exec() the user job, not fork() it. ".
I'm not sure why only ATLAS jobs seemed to be affected (ATLAS SAM tests were fine, however). I think the safest thing for us to do is to add (most of) the contents of /usr/sbin/mjf-job-wrapper to our ARC ENV/GLITE runtime environment, rather than try to use a job wrapper. But for the moment at least we have mjf disabled completely.
Thanks,
Andrew.
________________________________________
From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Andrew McNab [[log in to unmask]]
Sent: Wednesday, May 04, 2016 2:46 PM
To: [log in to unmask]
Subject: HTCondor Machine/Job Features testing?
Hi,
It would be really helpful to have a couple of UK sites who are running HTCondor, and would be prepared to test the new Machine/Job Features scripts. It just involves installing an RPM and making a one-line change to your HTCondor config. Any volunteers?
Thanks,
Andrew
--
Dr Andrew McNab
University of Manchester High Energy Physics, LHCb@CERN (Distributed Computing Coordinator), and GridPP (LHCb + Tier-2 Evolution) www.hep.manchester.ac.uk/u/mcnab
Skype: andrew.mcnab.uk
|