Print

Print


Well, I would say it's a matter of doing what Unix does: pass the exit status 
down to the caller. The purpose, I mean, the goal, probably depends on the 
person/system that takes a look at it. In my case:

- It will be very helpful for our monitoring system, allowing us to have a 
better reaction on strange failure rates.
- I can prepare better statistics on site success rate (and that includes user 
failures, but also SE problems or Software installation problems)
- I would like Grid to be "less alien" and behave like other systems. And I 
just think that's the way it should be.

I can only say they are good reasons to me. They're probably not too important 
to everyone, otherwise this would already be a feature in CreamCE.

Besides, if this creates a problem with the WMS, maybe the discussion is over.

BR/Pablo


On Monday 12 September 2011 07:16:57 you wrote:
> I still don't understand:
> 
> - if it is "just" the matter or easily exposing that information to site
>    admin, so that she can investigate possible problems at site level
> 
> - if jobs exited with an error code <> 0 (while there were no
>    other problems) should be considered failed jobs (as jobs failed because
>    e.g. the submission to the batch system failed, because e.g. the
>    transfer of the sandbox files failed, etc). And in this case e.g. the
>    resubmission through the WMS, if enabled, should be done. I don't
>    think this can't be a configurable behavior at site level ...
> 
>  				Cheers, Massimo
> 
> On Sun, 11 Sep 2011, Pablo Fernandez wrote:
> > Hi,
> > 
> >> these days (at our site) most of the user-level (payload) errors have
> >> nothing to do with the worker node or cluster itself.  common problems:
> >> 
> >> a storage element somewhere is not responding
> >> something is wrong with the VO-installed software
> >> user error (job just crashes due to programming errors)
> > 
> > Actually, from the list you've given, the first two items may be local
> > sysadmin business... on the third there is little we can do.
> > 
> > I still don't see the reason for masking... is it WMS resubmission? If
> > so, the only reason I see for not resubmitting is the last, the other
> > two may have been temporal stuff, timeouts...
> > 
> > I am also of the opinion that Grid should work as close as Unix as
> > possible, and this seems to be an effort on the opposite direction.
> > 
> > BR/Pablo
> > 
> >> if it were true that most payload errors were due to site problems, i'd
> >> agree with the approach. making it configurable is always okay as long
> >> as the configuration does not lead to lots of complexity. which in
> >> itself is another source of error.
> >> 
> >> 										JT
> >> 
> >> On Sep 10, 2011, at 23:49 , Maarten Litmaath wrote:
> >>> Ciao Massimo,
> >>> 
> >>>> First of all: there isn't anything different wrt the LCG-CE. Also for
> >>>> the LCG-CE the exit code that you see in the pbs log file is the one
> >>>> of the job wrapper (jw), and not the one of the user job, because it
> >>>> is the jw that is executed in the batch system.
> >>>> As I said, the jobwrapper is a script. Oversimplifying it, it is
> >>>> something like:
> >>>> 
> >>>> #/bin/sh
> >>>> < prepare exection env in WN>
> >>>> <get ISB>
> >>>> <run user job>
> >>>> <put OSB>
> >>>> 
> >>>> If this script runs properly, it returns 0 as exit code, and not the
> >>>> exit code of the user job. Again there is the very same scenario in
> >>>> the jw used for the LCG-CE.
> >>>> A value different than 0 means that there was a problem in the
> >>>> execution of the job wrapper (e.g. a problem with sandbox transfers)
> >>> 
> >>> That is the traditional view indeed.
> >>> 
> >>>> User job exit code is not hidden: it is returned in
> >>>> glite-ce-job-status output, in wms-job-status, in wms-logging-info.
> >>>> It was supposed to be reported also in the glite-ce-cream.log:
> >>>> investigating why this is not the case.
> >>>> 
> >>>> The management of jobs finished with an exit code <> 0 is something
> >>>> that was discussed several years ago, in the days of Datagrid. It was
> >>>> decided that they should consider as successfully done (so e.g. the
> >>>> WMS shouldn't trigger a resubmission) but the exit code <> 0 should
> >>>> be returned to the user so she can investigate.
> >>> 
> >>> Even that could be discussed again: since the payload may have failed
> >>> due to a problem with the site (e.g. full file system), a resubmission
> >>> could be desirable if the JDL allows it.  We may want to be careful
> >>> there and make that behavior depend on a new JDL attribute.
> >>> 
> >>>> I don't fully understand what is the RFE here. To have the jw returns
> >>>> with the user job exit code (so that this value is reported in the PBS
> >>>> log file) ?
> >>> 
> >>> Right.  It would seem nice if:
> >>> 
> >>> - the site admin could configure that behavior;
> >>> - the WMS could still distinguish between job wrapper and payload
> >>> problems.
> 
>                     \|||/
> -----------0oo----( o o )----oo0-------------------
>                      (_)
> INFN Sezione di Padova
> Via Marzolo, 8
> 35131 Padova - Italy    E-mail: massimo.sgaravatto [at] pd.infn.it
> Tel: ++39 0499677360    Skype: massimo.sgaravatto
> Fax: ++39 0498275952