On 2 Sep 2009, at 16:10, Maarten Litmaath wrote:
> Hi all,
> I have opened a bug about the WMS 3.2 job wrapper customization point
> issue reported by Emmanuel Medernach and Douglas McNab:
>
> https://savannah.cern.ch/bugs/index.php?55237
>
> Text for the known issues page:
>
> ----------------------------------------------------------------------
> -
> Starting with glite-WMS version 3.1.20-0 the job wrapper has its first
> customization point moved _after_ the creation of the uniquely named
> working directory for the job. This means that the first
> customization
> script ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh (if present) must
> _not_
> move the job into a directory with a static name (e.g. "/scratch"),
> but ensure the new working directory (if any) has a unique name.
> It also means such a new directory may not be cleaned up
> automatically,
> but should be cleaned up in ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh
> (note the 3).
> ----------------------------------------------------------------------
> -
After an SL5 upgrade of some worker nodes, I think I've noticed a
flaw with the plan.
The 3.1 Job wrapper executes cp_3.sh _before_ attempting to copy the
Maradonna file out of the work directory. This means that with the
above in place, cp_3.sh deletes the Maradona file, then the job
wrapper attempts to copy it back (which, of course, fails). There's
an exponential backoff and retry, before the CE reports the job as
failing. If WMS resubmission is allowed, the job gets resubmitted.
Note that this occurs _after_ the job has completed.
Following the workaround might work for the 3.2 job wrapper, but will
cause at minimum an hour delay with 3.1, and not let it get the
Maradona file back. The slightest problem with the Condor output,
and the fall back is already gone - this might explain some of the
Maradonna errors at some sites recently?
I'm not sure why we didn't spot this before - we've just upgraded
most of the clutster to SL5 worker nodes, so that might have some
subtle impact. (It's possible we've just not run any short test jobs
in the interim - the extra hour delay is the most noticable part to
me...). It is possible that there's some other explanation causing
the problems.
I note that, assuming I'm correct in attributing the source,
Marteen's suggestion on savannah of moving the cp_3 into doExit
(_after_ the Maradonna copy) would resolve this.
Has anyone else spotted this, or should I look elsewhere for the root
cause?
Details of the Job Wrapper below, in case I've made an error in
reading the consequences of this.
The gLite 3.1 WMS job wrapper ends with:
-- Start extract --
# customization point
if [ -n "${GLITE_LOCAL_CUSTOMIZATION_DIR}" ]; then
if [ -f "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh" ]; then
. "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh"
fi
fi
doExit 0
-- End extract --
AND the function do_exit is
-- Start extract --
doExit() # 1 - status, # 2 - mode
{
jw_status=$1
jw_echo "jw exit status = ${jw_status}"
retry_copy "globus-url-copy" "file://${workdir}/${maradona}" "$
{__maradonaprotocol}"
globus_copy_status=$?
rm -rf "../${newdir}"
if [ ${jw_status} -eq 0 ]; then
exit ${globus_copy_status}
else
exit ${jw_status}
fi
}
-- End extract --
TO fill in the details of ${workdir}
-- Start extract --
# customization point
if [ -n "${GLITE_LOCAL_CUSTOMIZATION_DIR}" ]; then
if [ -f "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh" ]; then
. "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh"
fi
fi
#if [ ${__job_type} -eq 0 -o ${__job_type} -eq 3 ]; then # normal or
interactive
newdir="${__jobid_to_filename}"
mkdir ${newdir}
cd ${newdir}
#elif [ ${__job_type} -eq 1 -o ${__job_type} -eq 2 ]; then # MPI (LSF
or PBS)
#fi
tmpfile=`mktemp -q tmp.XXXXXXXXXX`
if [ ! -f "$tmpfile" ]; then
fatal_error "Working directory not writable"
else
rm "$tmpfile"
fi
unset tmpfile
workdir="`pwd`"
if [ -n "${__brokerinfo}" ]; then
export GLITE_WMS_RB_BROKERINFO="`pwd`/${__brokerinfo}"
fi
maradona="${__jobid_to_filename}.output"
touch "${maradona}"
-- End extract --
|