We should push for the use of TMPDIR instead of this contorted mechanism.
http://savannah.cern.ch/bugs/?19138
they don't want to know about it but perhaps if enough people ask for it
they will think about it more in depth.
cheers
alessandra
Stuart Purdie wrote:
>
> On 2 Sep 2009, at 16:10, Maarten Litmaath wrote:
>
>> Hi all,
>> I have opened a bug about the WMS 3.2 job wrapper customization point
>> issue reported by Emmanuel Medernach and Douglas McNab:
>>
>> https://savannah.cern.ch/bugs/index.php?55237
>>
>> Text for the known issues page:
>>
>> -----------------------------------------------------------------------
>> Starting with glite-WMS version 3.1.20-0 the job wrapper has its first
>> customization point moved _after_ the creation of the uniquely named
>> working directory for the job. This means that the first customization
>> script ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh (if present) must _not_
>> move the job into a directory with a static name (e.g. "/scratch"),
>> but ensure the new working directory (if any) has a unique name.
>> It also means such a new directory may not be cleaned up automatically,
>> but should be cleaned up in ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh
>> (note the 3).
>> -----------------------------------------------------------------------
>
> After an SL5 upgrade of some worker nodes, I think I've noticed a
> flaw with the plan.
>
> The 3.1 Job wrapper executes cp_3.sh _before_ attempting to copy the
> Maradonna file out of the work directory. This means that with the
> above in place, cp_3.sh deletes the Maradona file, then the job
> wrapper attempts to copy it back (which, of course, fails). There's
> an exponential backoff and retry, before the CE reports the job as
> failing. If WMS resubmission is allowed, the job gets resubmitted.
> Note that this occurs _after_ the job has completed.
>
> Following the workaround might work for the 3.2 job wrapper, but will
> cause at minimum an hour delay with 3.1, and not let it get the
> Maradona file back. The slightest problem with the Condor output, and
> the fall back is already gone - this might explain some of the
> Maradonna errors at some sites recently?
>
> I'm not sure why we didn't spot this before - we've just upgraded most
> of the clutster to SL5 worker nodes, so that might have some subtle
> impact. (It's possible we've just not run any short test jobs in the
> interim - the extra hour delay is the most noticable part to me...).
> It is possible that there's some other explanation causing the problems.
>
> I note that, assuming I'm correct in attributing the source, Marteen's
> suggestion on savannah of moving the cp_3 into doExit (_after_ the
> Maradonna copy) would resolve this.
>
> Has anyone else spotted this, or should I look elsewhere for the root
> cause?
>
>
> Details of the Job Wrapper below, in case I've made an error in
> reading the consequences of this.
>
> The gLite 3.1 WMS job wrapper ends with:
>
> -- Start extract --
>
> # customization point
> if [ -n "${GLITE_LOCAL_CUSTOMIZATION_DIR}" ]; then
> if [ -f "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh" ]; then
> . "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh"
> fi
> fi
>
> doExit 0
>
> -- End extract --
>
>
> AND the function do_exit is
>
> -- Start extract --
>
> doExit() # 1 - status, # 2 - mode
> {
> jw_status=$1
>
> jw_echo "jw exit status = ${jw_status}"
>
> retry_copy "globus-url-copy" "file://${workdir}/${maradona}"
> "${__maradonaprotocol}"
> globus_copy_status=$?
>
> rm -rf "../${newdir}"
>
> if [ ${jw_status} -eq 0 ]; then
> exit ${globus_copy_status}
> else
> exit ${jw_status}
> fi
> }
>
> -- End extract --
>
>
> TO fill in the details of ${workdir}
>
> -- Start extract --
> # customization point
> if [ -n "${GLITE_LOCAL_CUSTOMIZATION_DIR}" ]; then
> if [ -f "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh" ]; then
> . "${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh"
> fi
> fi
>
> #if [ ${__job_type} -eq 0 -o ${__job_type} -eq 3 ]; then # normal or
> interactive
> newdir="${__jobid_to_filename}"
> mkdir ${newdir}
> cd ${newdir}
> #elif [ ${__job_type} -eq 1 -o ${__job_type} -eq 2 ]; then # MPI (LSF
> or PBS)
> #fi
>
> tmpfile=`mktemp -q tmp.XXXXXXXXXX`
> if [ ! -f "$tmpfile" ]; then
> fatal_error "Working directory not writable"
> else
> rm "$tmpfile"
> fi
> unset tmpfile
>
> workdir="`pwd`"
>
> if [ -n "${__brokerinfo}" ]; then
> export GLITE_WMS_RB_BROKERINFO="`pwd`/${__brokerinfo}"
> fi
>
> maradona="${__jobid_to_filename}.output"
> touch "${maradona}"
>
> -- End extract --
>
>
>
>
>
>
>
--
Mindmelds. The last time I heard the words "my mind to your mind", I had a headache for two weeks. (Janeway, ST Voyager)
Northgrid Tier2 Technical Coordinator
http://www.hep.manchester.ac.uk/computing/tier2
|