Maarten Litmaath wrote:
> The WMS job wrapper always tries a mkdir and then cd into the directory,
> but will continue when either operation fails. Does /var/log/messages
> show any problems for the "/dlocal" file system?
>
> Are there any errors under ~sgmali020/.globus/job/*/*?
Yes, on the same worker node, there are messages in stderr :
pwd
/users/lcg/sgmali020/.globus/job/nanlcg01.in2p3.fr
find . -name stderr -size 0 | wc -l
260
find . -name stderr -not -size 0 | wc -l
24
And the stdout/stderr looks like this :
more ./13969.1239792844/stdout
lcg-jobwrapper-hook.sh not readable
Take token:
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000004:APP=000000:LBS=000000
Job has been terminated by the batch system (SIGTERM)
jw exit status = 1
more ./13969.1239792844/stderr
/users/lcg/sgmali020/.globus/.gass_cache/local/md5/53/6b5744d385fc42f37ff06770a4c4d9/md5/15/326d224da16bedf7ab303c42c54ba8/data:
line 66: : No such file or directory
chmod: changing permissions of `/bin/sh': Operation not permitted
Terminated
/users/lcg/sgmali020/.globus/.gass_cache/local/md5/53/6b5744d385fc42f37ff06770a4c4d9/md5/15/326d224da16bedf7ab303c42c54ba8/data:
line 66:
https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fASwkbTuA6gHv8oTVI9HgfA.output:
No such file or directory
/usr/remote/public/GLITE-3_1_26-0/WN/glite/bin/glite-lb-logevent:
edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error
(edg_wll_LogEvent():
LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:
LB server (bkserver,lbproxy) store protocol error;;
edg_wll_DoLogEvent(): edg_wll_log_connect error
GSSAPI Error;; edg_wll_gss_connect();; GSS Error: GSS failure occured:
GSS Major Status: General failure
(GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with gss context
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Valid credentials could not be found in any of the
possible locations specified by the credential search order.
Valid credentials could not be found in any of the possible locations
specified
by the credential search order.
Attempt 1
globus_credential: Error reading host credential
globus_sysconfig: Could not find a valid certificate file: The host cert
could not be found in:
1) env. var. X509_USER_CERT
2) /etc/grid-security/hostcert.pem
3) $GLOBUS_LOCATION/etc/hostcert.pem
4) $HOME/.globus/hostcert.pem
The host key could not be found in:
1) env. var. X509_USER_KEY
2) /etc/grid-security/hostkey.pem
3) $GLOBUS_LOCATION/etc/hostkey.pem
4) $HOME/.globus/hostkey.pem
Attempt 2
globus_credential: Error reading proxy credential
globus_sysconfig: Could not find a valid proxy certificate file location
globus_sysconfig: Error with key filename
globus_sysconfig: File does not exist:
/users/lcg/sgmali020/.globus/job/nanlcg01.in2p3.fr/13969.1239792844/x509_up
is not a valid file
Attempt 3
globus_credential: Error reading user credential
globus_sysconfig: Error with certificate filename: The user cert could
not be found in:
1) env. var. X509_USER_CERT
2) $HOME/.globus/usercert.pem
3) $HOME/.globus/usercred.p12
/users/lcg/sgmali020/.globus/.gass_cache/local/md5/53/6b5744d385fc42f37ff06770a4c4d9/md5/15/326d224da16bedf7ab303c42c54ba8/data:
line 66:
https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fASwkbTuA6gHv8oTVI9HgfA.output:
No such file or directory
/usr/remote/public/GLITE-3_1_26-0/WN/glite/bin/glite-lb-logevent:
edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error
(edg_wll_LogEvent():
LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:
LB server (bkserver,lbproxy) store protocol error;;
edg_wll_DoLogEvent(): edg_wll_log_connect error
GSSAPI Error;; edg_wll_gss_connect();; GSS Error: GSS failure occured:
GSS Major Status: General failure
(GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with gss context
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Valid credentials could not be found in any of the
possible locations specified by the credential search order.
Valid credentials could not be found in any of the possible locations
specified
by the credential search order.
[...]
rm: cannot remove `/dev/null': Permission denied
rm: cannot remove `/dev/null': Permission denied
> That looks like a user payload error. The WMS wrapper does the following
> when it ends:
>
> rm -rf "../${newdir}"
>
> Here ${newdir} looks like "https_3a_2f_2f.....".
>
> I hope $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh does not redefine it?!
Here is my current cp_1.sh :
#!/bin/sh
#d=`mktemp -d /dlocal/job-XXXXXXXXXX` || exit
d=`mktemp -d /dlocal/job-XXXXXXXXXX` || exit
#d='/dlocal'
#export HOME=$d
cd $d || exit
The previous one was :
#!/bin/sh
#d=`mktemp -d /dlocal/job-XXXXXXXXXX` || exit
d='/dlocal'
#export HOME=$d
cd $d || exit
But I decided to use a "job-XXX" directory to protect in case the
removal of "https..." directories starts one level up.
Please note that the extract from stderr above indicates that an
attempt was made to remove /dev/null...
Thanks very much for your help.
JM
--
------------------------------------------------------------------------
Jean-michel BARBET | Tel: +33 (0)2 51 85 84 86
Laboratoire SUBATECH Nantes France | Fax: +33 (0)2 51 85 84 79
CNRS-IN2P3/Ecole des Mines/Universite | E-Mail: [log in to unmask]
------------------------------------------------------------------------
|