Hi all
First of all: there isn't anything different wrt the LCG-CE. Also for the
LCG-CE the exit code that you see in the pbs log file is the one of the
job wrapper (jw), and not the one of the user job, because it is the jw
that is executed in the batch system.
As I said, the jobwrapper is a script. Oversimplifying it, it is
something like:
#/bin/sh
< prepare exection env in WN>
<get ISB>
<run user job>
<put OSB>
If this script runs properly, it returns 0 as exit code, and not the
exit code of the user job. Again there is the very same scenario in the jw
used for the LCG-CE.
A value different than 0 means that there was a problem in the execution
of the job wrapper (e.g. a problem with sandbox transfers)
User job exit code is not hidden: it is returned in glite-ce-job-status
output, in wms-job-status, in wms-logging-info.
It was supposed to be reported also in the glite-ce-cream.log:
investigating why this is not the case.
The management of jobs finished with an exit code <> 0 is something that
was discussed several years ago, in the days of Datagrid. It was decided
that they should consider as successfully done (so e.g. the WMS shouldn't
trigger a resubmission) but the exit code <> 0 should be returned to the
user so she can investigate.
I don't fully understand what is the RFE here. To have the jw returns with
the user job exit code (so that this value is reported in the PBS log
file) ?
Cheers, Massimo
On Fri, 9 Sep 2011, Maarten Litmaath wrote:
> Hi Pablo,
>
>> Well, I don't think making the sysadmin less aware of the things that happen
>> inside the cluster is good at all. If a sysadmin doesn't want to see if a user
>> fails his jobs, he/she can just avoid looking at the PBS exit status... but
>> what if I want to know? What if I make statistics of how good my cluster is?
>> There could be a user problem, but it could also be your SE giving trouble,
>> and that affects the sysadmin. Or a timeout, or a full disk somewhere...
>>
>> Some months ago, I developed a ganglia metric to measure errors in our pbs
>> system:
>> http://ganglia.lcg.cscs.ch/ganglia//pbserrors.html
>> Then I also plot it, and show it in our main monitoring page. It turned out to
>> be VERY useful. If you see one failure here or there, you can just assume it's
>> the normal grid stuff. But if suddenly the number of errors/time rises, that
>> means something bad is going on. If it's all from the same user, it's probably
>> a user problem. If it's all from a worker node, or from a CE... well, you can
>> see patterns.
>>
>> There are many more possibilities around this. It's information, usable,
>> filterable, and IMHO shouldn't be masked.
>
> Those are good arguments indeed. Feel free to open an RFE in GGUS.
>
> Maybe it could even be made configurable in CREAM...
>
\|||/
-----------0oo----( o o )----oo0-------------------
(_)
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
Tel: ++39 0499677360 Skype: massimo.sgaravatto
Fax: ++39 0498275952
|