JISCMail - LCG-ROLLOUT Archives

Well, I would say it's a matter of doing what Unix does: pass the exit status down to the caller. The purpose, I mean, the goal, probably depends on the person/system that takes a look at it. In my case:

- It will be very helpful for our monitoring system, allowing us to have a better reaction on strange failure rates.

- I can prepare better statistics on site success rate (and that includes user failures, but also SE problems or Software installation problems)

- I would like Grid to be "less alien" and behave like other systems. And I just think that's the way it should be.

I can only say they are good reasons to me. They're probably not too important to everyone, otherwise this would already be a feature in CreamCE.

Besides, if this creates a problem with the WMS, maybe the discussion is over.

BR/Pablo

On Monday 12 September 2011 07:16:57 you wrote:

> I still don't understand:

> - if it is "just" the matter or easily exposing that information to site

> admin, so that she can investigate possible problems at site level

> - if jobs exited with an error code <> 0 (while there were no

> other problems) should be considered failed jobs (as jobs failed because

> e.g. the submission to the batch system failed, because e.g. the

> transfer of the sandbox files failed, etc). And in this case e.g. the

> resubmission through the WMS, if enabled, should be done. I don't

> think this can't be a configurable behavior at site level ...

> Cheers, Massimo

> On Sun, 11 Sep 2011, Pablo Fernandez wrote:

> > Hi,

> >

> >> these days (at our site) most of the user-level (payload) errors have

> >> nothing to do with the worker node or cluster itself. common problems:

> >>

> >> a storage element somewhere is not responding

> >> something is wrong with the VO-installed software

> >> user error (job just crashes due to programming errors)

> >

> > Actually, from the list you've given, the first two items may be local

> > sysadmin business... on the third there is little we can do.

> >

> > I still don't see the reason for masking... is it WMS resubmission? If

> > so, the only reason I see for not resubmitting is the last, the other

> > two may have been temporal stuff, timeouts...

> >

> > I am also of the opinion that Grid should work as close as Unix as

> > possible, and this seems to be an effort on the opposite direction.

> >

> > BR/Pablo

> >

> >> if it were true that most payload errors were due to site problems, i'd

> >> agree with the approach. making it configurable is always okay as long

> >> as the configuration does not lead to lots of complexity. which in

> >> itself is another source of error.

> >>

> >> JT

> >>

> >> On Sep 10, 2011, at 23:49 , Maarten Litmaath wrote:

> >>> Ciao Massimo,

> >>>

> >>>> First of all: there isn't anything different wrt the LCG-CE. Also for

> >>>> the LCG-CE the exit code that you see in the pbs log file is the one

> >>>> of the job wrapper (jw), and not the one of the user job, because it

> >>>> is the jw that is executed in the batch system.

> >>>> As I said, the jobwrapper is a script. Oversimplifying it, it is

> >>>> something like:

> >>>>

> >>>> #/bin/sh

> >>>> < prepare exection env in WN>

> >>>> <get ISB>

> >>>> <run user job>

> >>>> <put OSB>

> >>>>

> >>>> If this script runs properly, it returns 0 as exit code, and not the

> >>>> exit code of the user job. Again there is the very same scenario in

> >>>> the jw used for the LCG-CE.

> >>>> A value different than 0 means that there was a problem in the

> >>>> execution of the job wrapper (e.g. a problem with sandbox transfers)

> >>>

> >>> That is the traditional view indeed.

> >>>

> >>>> User job exit code is not hidden: it is returned in

> >>>> glite-ce-job-status output, in wms-job-status, in wms-logging-info.

> >>>> It was supposed to be reported also in the glite-ce-cream.log:

> >>>> investigating why this is not the case.

> >>>>

> >>>> The management of jobs finished with an exit code <> 0 is something

> >>>> that was discussed several years ago, in the days of Datagrid. It was

> >>>> decided that they should consider as successfully done (so e.g. the

> >>>> WMS shouldn't trigger a resubmission) but the exit code <> 0 should

> >>>> be returned to the user so she can investigate.

> >>>

> >>> Even that could be discussed again: since the payload may have failed

> >>> due to a problem with the site (e.g. full file system), a resubmission

> >>> could be desirable if the JDL allows it. We may want to be careful

> >>> there and make that behavior depend on a new JDL attribute.

> >>>

> >>>> I don't fully understand what is the RFE here. To have the jw returns

> >>>> with the user job exit code (so that this value is reported in the PBS

> >>>> log file) ?

> >>>

> >>> Right. It would seem nice if:

> >>>

> >>> - the site admin could configure that behavior;

> >>> - the WMS could still distinguish between job wrapper and payload

> >>> problems.

> \|||/

> -----------0oo----( o o )----oo0-------------------

> (_)

> INFN Sezione di Padova

> Via Marzolo, 8

> 35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it

> Tel: ++39 0499677360 Skype: massimo.sgaravatto

> Fax: ++39 0498275952