Well, I would say it's a matter of doing what Unix does: pass the exit status down to the caller. The purpose, I mean, the goal, probably depends on the person/system that takes a look at it. In my case:
- It will be very helpful for our monitoring system, allowing us to have a better reaction on strange failure rates.
- I can prepare better statistics on site success rate (and that includes user failures, but also SE problems or Software installation problems)
- I would like Grid to be "less alien" and behave like other systems. And I just think that's the way it should be.
I can only say they are good reasons to me. They're probably not too important to everyone, otherwise this would already be a feature in CreamCE.
Besides, if this creates a problem with the WMS, maybe the discussion is over.
BR/Pablo
On Monday 12 September 2011 07:16:57 you wrote:
> I still don't understand:
>
> - if it is "just" the matter or easily exposing that information to site
> admin, so that she can investigate possible problems at site level
>
> - if jobs exited with an error code <> 0 (while there were no
> other problems) should be considered failed jobs (as jobs failed because
> e.g. the submission to the batch system failed, because e.g. the
> transfer of the sandbox files failed, etc). And in this case e.g. the
> resubmission through the WMS, if enabled, should be done. I don't
> think this can't be a configurable behavior at site level ...
>
> Cheers, Massimo
>
> On Sun, 11 Sep 2011, Pablo Fernandez wrote:
> > Hi,
> >
> >> these days (at our site) most of the user-level (payload) errors have
> >> nothing to do with the worker node or cluster itself. common problems:
> >>
> >> a storage element somewhere is not responding
> >> something is wrong with the VO-installed software
> >> user error (job just crashes due to programming errors)
> >
> > Actually, from the list you've given, the first two items may be local
> > sysadmin business... on the third there is little we can do.
> >
> > I still don't see the reason for masking... is it WMS resubmission? If
> > so, the only reason I see for not resubmitting is the last, the other
> > two may have been temporal stuff, timeouts...
> >
> > I am also of the opinion that Grid should work as close as Unix as
> > possible, and this seems to be an effort on the opposite direction.
> >
> > BR/Pablo
> >
> >> if it were true that most payload errors were due to site problems, i'd
> >> agree with the approach. making it configurable is always okay as long
> >> as the configuration does not lead to lots of complexity. which in
> >> itself is another source of error.
> >>
> >> JT
> >>
> >> On Sep 10, 2011, at 23:49 , Maarten Litmaath wrote:
> >>> Ciao Massimo,
> >>>
> >>>> First of all: there isn't anything different wrt the LCG-CE. Also for
> >>>> the LCG-CE the exit code that you see in the pbs log file is the one
> >>>> of the job wrapper (jw), and not the one of the user job, because it
> >>>> is the jw that is executed in the batch system.
> >>>> As I said, the jobwrapper is a script. Oversimplifying it, it is
> >>>> something like:
> >>>>
> >>>> #/bin/sh
> >>>> < prepare exection env in WN>
> >>>> <get ISB>
> >>>> <run user job>
> >>>> <put OSB>
> >>>>
> >>>> If this script runs properly, it returns 0 as exit code, and not the
> >>>> exit code of the user job. Again there is the very same scenario in
> >>>> the jw used for the LCG-CE.
> >>>> A value different than 0 means that there was a problem in the
> >>>> execution of the job wrapper (e.g. a problem with sandbox transfers)
> >>>
> >>> That is the traditional view indeed.
> >>>
> >>>> User job exit code is not hidden: it is returned in
> >>>> glite-ce-job-status output, in wms-job-status, in wms-logging-info.
> >>>> It was supposed to be reported also in the glite-ce-cream.log:
> >>>> investigating why this is not the case.
> >>>>
> >>>> The management of jobs finished with an exit code <> 0 is something
> >>>> that was discussed several years ago, in the days of Datagrid. It was
> >>>> decided that they should consider as successfully done (so e.g. the
> >>>> WMS shouldn't trigger a resubmission) but the exit code <> 0 should
> >>>> be returned to the user so she can investigate.
> >>>
> >>> Even that could be discussed again: since the payload may have failed
> >>> due to a problem with the site (e.g. full file system), a resubmission
> >>> could be desirable if the JDL allows it. We may want to be careful
> >>> there and make that behavior depend on a new JDL attribute.
> >>>
> >>>> I don't fully understand what is the RFE here. To have the jw returns
> >>>> with the user job exit code (so that this value is reported in the PBS
> >>>> log file) ?
> >>>
> >>> Right. It would seem nice if:
> >>>
> >>> - the site admin could configure that behavior;
> >>> - the WMS could still distinguish between job wrapper and payload
> >>> problems.
>
> \|||/
> -----------0oo----( o o )----oo0-------------------
> (_)
> INFN Sezione di Padova
> Via Marzolo, 8
> 35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
> Tel: ++39 0499677360 Skype: massimo.sgaravatto
> Fax: ++39 0498275952