On Fri, 23 May 2014, Mischa Salle wrote:
> Hi Valery,
>
> could you explain a little bit more exactly which OPS SAM test do you
> mean?
Seems to me, the sensor run two glexec in a row, the second one for
getting the result from the first, glexec' epilogue remove the result
before the second glexec run, so glexec' epilogue useless in this
scenario.
> The more I think about it the less I understand why a Torque epilogue
> script, which is running as root, is not able to cleanup the directory?!
> I think I still don't really understand the original problem.
My goal is not to create torque epilogue, but use internal
torque feature to create temporary dir at job start and
remove it at the end. This works for years on our farm
before glexec was introduced.
>
> Cheers,
> Mischa
>
> On Fri, May 23, 2014 at 02:31:56PM +0400, Valery Mitsyn wrote:
>> Hi Mischa,
>>
>> unfortunately, it does not work, or rather leads to errors in the
>> OPS SAM test. The problem is that the sensor is trying
>> to send data to nagios after payload and epilogue has been
>> finished, so at the moment when the result already removed.
>> My epilogue looks like follow:
>> {{{
>> #!/bin/sh
>> TMPBASEDIR=/scr/u
>> logfile=/tmp/glexec/glexec_epilogue.log
>> test -e $logfile || /bin/mkdir -m 700 -p `dirname $logfile`
>> rm -f $logfile
>> /bin/touch $logfile || exit 1
>> /bin/chmod 700 $logfile || exit 1
>> /bin/chown root.root $logfile || exit 1
>> if test X"$GLEXEC_EPILOG_TARGET_USER" = "X" ; then
>> echo "Warning: empty GLEXEC_EPILOG_TARGET_USER variable" >> $logfile
>> exit 0
>> fi
>> if test X"$GLEXEC_EPILOG_GLEXEC_USER" = "X" ; then
>> echo "Warning: empty GLEXEC_EPILOG_GLEXEC_USER variable" >> $logfile
>> exit 0
>> fi
>> if test X"$PBS_JOBID" = "X" ; then
>> echo "Warning: empty PBS_JOBID variable" >> $logfile
>> exit 0
>> fi
>> /bin/su $GLEXEC_EPILOG_TARGET_USER -c \
>> "/usr/sbin/tmpwatch -afq -U root 0m $TMPBASEDIR/$PBS_JOBID" \
>> 2>&1 >> $logfile
>> exit 0
>> }}}
>> It works, that is, removes exactly what is required.
>> Seems to me, one way to avoid problems in SAM would be
>> "chown -Rrh --preserve-root ..." in the whole dirs tree.
>> But it's less safe then tmpwatch.
>>
>> Does anyone have a better idea?
>>
>> On Tue, 20 May 2014, Mischa Salle wrote:
>>
>>> On Tue, May 20, 2014 at 05:43:14PM +0400, Valery Mitsyn wrote:
>>>> Yes, two questions here:
>>>>
>>>> 1) is there a sample script to remove the working directory?
>>>> I'm afraid to experiment with a fully loaded farm.
>>>>
>>>> 2) it is safe to rerun yaim for glexec or one could set some
>>>> vars in yaim's config to setup epilogue params in glexec.conf?
>>>
>>> No in both cases. Concerning YAIM, it basically it's too site-specific
>>> to give general guidelines. And YAIM was written (long ago) with the
>>> idea to always fully replace the existing files, no merging.
>>>
>>> Normally it should be sufficient to only add one line to the
>>> glexec.conf:
>>> epilogue = <path-of-epilogue script>
>>> The file must be 'trusted', i.e. only writable for the epilogue user
>>> (root).
>>>
>>> Writing such a script should not be difficult. What you could do first
>>> is write a testscript that just echo-s to a file what eventually would
>>> do, check that that is indeed the correct command, and only then run a
>>> real version.
>>> So something like
>>> #!/bin/sh
>>> # EXAMPLE ONLY, PLEASE ADAPT BEFORE USING
>>> logfile=/var/log/glexec/glexec_epilogue.log
>>> # Create log file and directory when needed
>>> if [ ! -e $logfile ];then
>>> mkdir -m 700 -p `dirname $logfile` && \
>>> touch $logfile || \
>>> exit 1
>>> fi
>>> # Check we have the target user
>>> if [ -z "$GLEXEC_EPILOG_TARGET_USER" ];then
>>> echo "Warning: empty GLEXEC_EPILOG_TARGET_USER variable" >> $logfile
>>> exit 0
>>> fi
>>> # Remove the custom user directory
>>> userdir=/tmp/userdir/$GLEXEC_EPILOG_TARGET_USER
>>> if [ -d $userdir ];then
>>> echo "Removing user directory \"$userdir\"" >> $logfile
>>> echo rm -rf $userdir >> $logfile
>>> else
>>> echo "User dir \"$userdir\" does not exist" >> $logfile
>>> fi
>>>
>>> On Tue, May 20, 2014 at 03:57:13PM +0200, Maarten Litmaath wrote:
>>>> As Mischa wrote, the gLExec epilogue script should help a lot with that,
>>>> but you anyway need something like a cron job that runs often to clean up
>>>> junk left behind by jobs that crashed or were killed by the batch system.
>>> That's partially true, but as long as gLExec runs in linger mode and is
>>> not directly sent a SIGKILL by the batch system, the epilogue should
>>> run, so also when a job is killed by the batch system, or when its
>>> payload crashes.
>>>
>>> Cheers,
>>> Mischa
>>>
>>>
>>
>> --
>> Best regards,
>> Valery Mitsyn
>
>
--
Best regards,
Valery Mitsyn
|