Ciao Massimo,
> First of all: there isn't anything different wrt the LCG-CE. Also for the
> LCG-CE the exit code that you see in the pbs log file is the one of the job
> wrapper (jw), and not the one of the user job, because it is the jw that is
> executed in the batch system.
> As I said, the jobwrapper is a script. Oversimplifying it, it is something
> like:
>
> #/bin/sh
> < prepare exection env in WN>
> <get ISB>
> <run user job>
> <put OSB>
>
> If this script runs properly, it returns 0 as exit code, and not the
> exit code of the user job. Again there is the very same scenario in the jw
> used for the LCG-CE.
> A value different than 0 means that there was a problem in the execution of
> the job wrapper (e.g. a problem with sandbox transfers)
That is the traditional view indeed.
> User job exit code is not hidden: it is returned in glite-ce-job-status
> output, in wms-job-status, in wms-logging-info.
> It was supposed to be reported also in the glite-ce-cream.log: investigating
> why this is not the case.
>
> The management of jobs finished with an exit code <> 0 is something that was
> discussed several years ago, in the days of Datagrid. It was decided that they
> should consider as successfully done (so e.g. the WMS shouldn't trigger a
> resubmission) but the exit code <> 0 should be returned to the user so she can
> investigate.
Even that could be discussed again: since the payload may have failed due
to a problem with the site (e.g. full file system), a resubmission could be
desirable if the JDL allows it. We may want to be careful there and make
that behavior depend on a new JDL attribute.
> I don't fully understand what is the RFE here. To have the jw returns with the
> user job exit code (so that this value is reported in the PBS log file) ?
Right. It would seem nice if:
- the site admin could configure that behavior;
- the WMS could still distinguish between job wrapper and payload problems.
|