Hi Zvi,
> I have basically played with the advice in
> https://wiki.egi.eu/wiki/Tools/Manuals/SiteProblemsFollowUp
> item TS81 (Workload management) at :
> https://wiki.egi.eu/wiki/Tools/Manuals/TS81 (Error = Cannot take token)
See further down in this message.
> according to TS81 - ( which I do not understand much and request some explanation) ,
> I add to my simple jdl program the statement : ShallowRetryCount = -1
Note: that workaround should not be needed if the worker node has the
"uberftp" command installed and a correct Globus setup; see below.
> With this statement added , I run again :
> glite-wms-job-submit -e https://wms-ce.haifa.il.ibm.com:7443/glite_wms_wmproxy_server -r eladby-temp.haifa.il.ibm.com:8443/cream-pbs-kzvo -a ex.jdl
Can you show us the JDL file? I bet it does not contain an input sandbox,
thereby causing the job to fail at a later stage than for a typical job.
Can you try with this JDL file instead:
-----------------------------------------------------------------------------
Type = "Job";
JobType = "Normal";
Executable = "/bin/hostname";
StdOutput = "hello.out";
StdError = "hello.err";
InputSandbox = {"/etc/group"};
OutputSandbox = {"hello.out","hello.err"};
RetryCount = 0;
ShallowRetryCount = 0;
-----------------------------------------------------------------------------
Then provide the output of the job submission command, if that fails.
If that works, wait again for the job to reach a "final" state and
provide the output of: glite-wms-job-logging-info -v 2 .....
Let's look further at your test job:
> [...]
>
> ======================= glite-wms-job-status Success =====================
> BOOKKEEPING INFORMATION:
>
> Current Status: Running
> Status Reason: unavailable
> Destination: eladby-temp.haifa.il.ibm.com:8443/cream-pbs-kzvo
> Submitted: Thu Jun 20 20:26:20 2013 IDT
> ==========================================================================
As far as the WMS is concerned, the job is running. Since further below you
showed the job actually has finished as far as CREAM is concerned, it would
mean the ICE daemon on the WMS (glite-wms-ice) could not get status updates
from the CREAM service: please check /var/log/wms/ice.log* for errors...
> Then:
>
> [dubi@ui ~]$ glite-wms-job-logging-info -v 2 https://wms-ce.haifa.il.ibm.com:9000/m6beChpPi9-8-n0UrKzDgA
>
> [...]
> ---
> Event: Done
> - Arrived = Thu Jun 20 20:26:33 2013 IDT
> - Exit code = 499467184
> - Host = matlab.haifa.il.ibm.com
> - Reason = job completed
> - Source = LRMS
> - Status code = OK
> - Timestamp = Thu Jun 20 20:26:32 2013 IDT
> - User = /DC=org/DC=terena/DC=tcs/C=IL/O=IUCC/CN=Zvi Dubitzki [log in to unmask]
Note: the source of that Done event was the "LRMS" (a misleading name),
which actually means the job wrapper script. Unfortunately the WMS
_cannot_ rely on that event and therefore ignores it (we can discuss
the reason in another thread). Instead, the WMS waits for the
_LogMonitor_ to log the Done event, but the last event logged by that
daemon was the transfer to CREAM:
> ---
> Event: Transfer
> - Arrived = Thu Jun 20 20:26:26 2013 IDT
> - Dest host = https://eladby-temp.haifa.il.ibm.com:8443/ce-cream/services/CREAM2
> - Dest instance = unavailable
> - Dest jobid = https://eladby-temp.haifa.il.ibm.com:8443/CREAM267019922
> - Destination = LRMS
> - Host = wms-ce.haifa.il.ibm.com
> - Reason = unavailable
> - Result = OK
> - Source = LogMonitor
> - Timestamp = Thu Jun 20 20:26:26 2013 IDT
> - User = /DC=org/DC=terena/DC=tcs/C=IL/O=IUCC/CN=Zvi Dubitzki [log in to unmask]
> ===================================
>
> seems OK
As I explained above, that logging info is not sufficient.
> [...]
>
> Then I try( WMS query):
> -------------------------------
>
> [dubi@ui ~]$ glite-wms-job-output --dir /home/dubi/result https://wms-ce.haifa.il.ibm.com:9000/m6beChpPi9-8-n0UrKzDgA
>
> Connecting to the service https://wms-ce.haifa.il.ibm.com:7443/glite_wms_wmproxy_server
>
>
> ================================================================================
>
> JOB GET OUTPUT OUTCOME
>
> No output files to be retrieved for the job:
> https://wms-ce.haifa.il.ibm.com:9000/m6beChpPi9-8-n0UrKzDgA
>
> ================================================================================
That result suggests the JDL file had no OutputSandbox specified!
> So no output (yet) . although a trivial /bin/hostname command was submitted
>
> and I try a CE level query for output and get :
> ---------------------------------------------
>
> [dubi@ui ~]$ glite-ce-job-output https://eladby-temp.haifa.il.ibm.com:8443/CREAM267019922
>
> 2013-06-20 20:28:50,407 INFO - For JobID [https://eladby-temp.haifa.il.ibm.com:8443/CREAM267019922] output will be stored in the dir ./eladby-temp.haifa.il.ibm.com_8443_CREAM267019922
> No match for *
> 2013-06-20 20:28:50,770 ERROR - UBERFTP ERROR OUTPUT: 220 eladby-temp.haifa.il.ibm.com GridFTP Server 6.19 (gcc64, 1359994843-83) [Globus Toolkit 5.2.3] ready.
> 230 User kzvo001 logged in.
> Using 1 parallel data chanels for extended block transfers
>
When a WMS is used, the job stores its output directly on the WMS,
usually via globus-url-copy (GridFTP). No output on the CE.
> Any idea why there is no output for the simple /bin/hostname ?
See aforementioned explanations.
> Note that running as originally the glite-wms-job-submit without the 'ShallowRetryCount = -1 '
> statement ( i.e with default retrycount=10) - the wms job status returns - after Running for a while :
>
> ======================= glite-wms-job-status Success =====================
> BOOKKEEPING INFORMATION:
>
> Status info for the Job : https://wms-ce.haifa.il.ibm.com:9000/Lg8vRE9FoXiF0SrOAIB-BA
> Current Status: Aborted
> Logged Reason(s):
> - Cannot take token
> - Cannot take token
> - Cannot take token
> - Cannot take token; reason=1; Failed to init security context
> GSS Major Status: Authentication Failed GSS Minor Status Error Chain:
> globus_gsi_gssapi: SSLv3 handshake problems OpenSSL Error: s3_clnt.c:915:
> in library: SSL routines, function SSL3_GET_SERVER_CERTIFICATE:
> certificate verify failed globus_gsi_callback_module:
> Could not verify credential globus_gsi_callback_module:
> Can't get the local trusted CA certificate:
> Untrusted self-signed certificate in chain with hash ce630362 [...]
There you have it: the job wrapper tried to set up a GridFTP session with
the WMS, but the WMS presented itself with a certificate from an unknown CA!
You need to ensure the worker node has that CA installed.
|