Well, I would say it's a matter of doing what Unix does: pass the exit status down to the caller. The purpose, I mean, the goal, probably depends on the person/system that takes a look at it. In my case: - It will be very helpful for our monitoring system, allowing us to have a better reaction on strange failure rates. - I can prepare better statistics on site success rate (and that includes user failures, but also SE problems or Software installation problems) - I would like Grid to be "less alien" and behave like other systems. And I just think that's the way it should be. I can only say they are good reasons to me. They're probably not too important to everyone, otherwise this would already be a feature in CreamCE. Besides, if this creates a problem with the WMS, maybe the discussion is over. BR/Pablo On Monday 12 September 2011 07:16:57 you wrote: > I still don't understand: > > - if it is "just" the matter or easily exposing that information to site > admin, so that she can investigate possible problems at site level > > - if jobs exited with an error code <> 0 (while there were no > other problems) should be considered failed jobs (as jobs failed because > e.g. the submission to the batch system failed, because e.g. the > transfer of the sandbox files failed, etc). And in this case e.g. the > resubmission through the WMS, if enabled, should be done. I don't > think this can't be a configurable behavior at site level ... > > Cheers, Massimo > > On Sun, 11 Sep 2011, Pablo Fernandez wrote: > > Hi, > > > >> these days (at our site) most of the user-level (payload) errors have > >> nothing to do with the worker node or cluster itself. common problems: > >> > >> a storage element somewhere is not responding > >> something is wrong with the VO-installed software > >> user error (job just crashes due to programming errors) > > > > Actually, from the list you've given, the first two items may be local > > sysadmin business... on the third there is little we can do. > > > > I still don't see the reason for masking... is it WMS resubmission? If > > so, the only reason I see for not resubmitting is the last, the other > > two may have been temporal stuff, timeouts... > > > > I am also of the opinion that Grid should work as close as Unix as > > possible, and this seems to be an effort on the opposite direction. > > > > BR/Pablo > > > >> if it were true that most payload errors were due to site problems, i'd > >> agree with the approach. making it configurable is always okay as long > >> as the configuration does not lead to lots of complexity. which in > >> itself is another source of error. > >> > >> JT > >> > >> On Sep 10, 2011, at 23:49 , Maarten Litmaath wrote: > >>> Ciao Massimo, > >>> > >>>> First of all: there isn't anything different wrt the LCG-CE. Also for > >>>> the LCG-CE the exit code that you see in the pbs log file is the one > >>>> of the job wrapper (jw), and not the one of the user job, because it > >>>> is the jw that is executed in the batch system. > >>>> As I said, the jobwrapper is a script. Oversimplifying it, it is > >>>> something like: > >>>> > >>>> #/bin/sh > >>>> < prepare exection env in WN> > >>>> <get ISB> > >>>> <run user job> > >>>> <put OSB> > >>>> > >>>> If this script runs properly, it returns 0 as exit code, and not the > >>>> exit code of the user job. Again there is the very same scenario in > >>>> the jw used for the LCG-CE. > >>>> A value different than 0 means that there was a problem in the > >>>> execution of the job wrapper (e.g. a problem with sandbox transfers) > >>> > >>> That is the traditional view indeed. > >>> > >>>> User job exit code is not hidden: it is returned in > >>>> glite-ce-job-status output, in wms-job-status, in wms-logging-info. > >>>> It was supposed to be reported also in the glite-ce-cream.log: > >>>> investigating why this is not the case. > >>>> > >>>> The management of jobs finished with an exit code <> 0 is something > >>>> that was discussed several years ago, in the days of Datagrid. It was > >>>> decided that they should consider as successfully done (so e.g. the > >>>> WMS shouldn't trigger a resubmission) but the exit code <> 0 should > >>>> be returned to the user so she can investigate. > >>> > >>> Even that could be discussed again: since the payload may have failed > >>> due to a problem with the site (e.g. full file system), a resubmission > >>> could be desirable if the JDL allows it. We may want to be careful > >>> there and make that behavior depend on a new JDL attribute. > >>> > >>>> I don't fully understand what is the RFE here. To have the jw returns > >>>> with the user job exit code (so that this value is reported in the PBS > >>>> log file) ? > >>> > >>> Right. It would seem nice if: > >>> > >>> - the site admin could configure that behavior; > >>> - the WMS could still distinguish between job wrapper and payload > >>> problems. > > \|||/ > -----------0oo----( o o )----oo0------------------- > (_) > INFN Sezione di Padova > Via Marzolo, 8 > 35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it > Tel: ++39 0499677360 Skype: massimo.sgaravatto > Fax: ++39 0498275952