Hi Pablo, all
Once we add the application error code in the CREAM log file, for your
monitoring system I would suggest to parse these log files.
Besides the information that you can find in the pbs log files, you can
find many other information, such as the failure reason, the user DN,
the worker node, etc.
A different issue discussed in this thread is the
management of jobs for which the application error code is different than 0.
Someone claimed that they should be considered failed jobs to all intents and
purposes ...
Cheers, Massimo
On Mon, 12 Sep 2011, Pablo Fernandez wrote:
>
> Well, I would say it's a matter of doing what Unix does: pass the exit status down to the
> caller. The purpose, I mean, the goal, probably depends on the person/system that takes a look
> at it. In my case:
>
>
> - It will be very helpful for our monitoring system, allowing us to have a better reaction on
> strange failure rates.
>
> - I can prepare better statistics on site success rate (and that includes user failures, but
> also SE problems or Software installation problems)
>
> - I would like Grid to be "less alien" and behave like other systems. And I just think that's
> the way it should be.
>
>
> I can only say they are good reasons to me. They're probably not too important to everyone,
> otherwise this would already be a feature in CreamCE.
>
>
> Besides, if this creates a problem with the WMS, maybe the discussion is over.
>
>
> BR/Pablo
>
>
>
> On Monday 12 September 2011 07:16:57 you wrote:
>
> > I still don't understand:
>
> >
>
> > - if it is "just" the matter or easily exposing that information to site
>
> > admin, so that she can investigate possible problems at site level
>
> >
>
> > - if jobs exited with an error code <> 0 (while there were no
>
> > other problems) should be considered failed jobs (as jobs failed because
>
> > e.g. the submission to the batch system failed, because e.g. the
>
> > transfer of the sandbox files failed, etc). And in this case e.g. the
>
> > resubmission through the WMS, if enabled, should be done. I don't
>
> > think this can't be a configurable behavior at site level ...
>
> >
>
> > Cheers, Massimo
>
> >
>
> > On Sun, 11 Sep 2011, Pablo Fernandez wrote:
>
> > > Hi,
>
> > >
>
> > >> these days (at our site) most of the user-level (payload) errors have
>
> > >> nothing to do with the worker node or cluster itself. common problems:
>
> > >>
>
> > >> a storage element somewhere is not responding
>
> > >> something is wrong with the VO-installed software
>
> > >> user error (job just crashes due to programming errors)
>
> > >
>
> > > Actually, from the list you've given, the first two items may be local
>
> > > sysadmin business... on the third there is little we can do.
>
> > >
>
> > > I still don't see the reason for masking... is it WMS resubmission? If
>
> > > so, the only reason I see for not resubmitting is the last, the other
>
> > > two may have been temporal stuff, timeouts...
>
> > >
>
> > > I am also of the opinion that Grid should work as close as Unix as
>
> > > possible, and this seems to be an effort on the opposite direction.
>
> > >
>
> > > BR/Pablo
>
> > >
>
> > >> if it were true that most payload errors were due to site problems, i'd
>
> > >> agree with the approach. making it configurable is always okay as long
>
> > >> as the configuration does not lead to lots of complexity. which in
>
> > >> itself is another source of error.
>
> > >>
>
> > >> JT
>
> > >>
>
> > >> On Sep 10, 2011, at 23:49 , Maarten Litmaath wrote:
>
> > >>> Ciao Massimo,
>
> > >>>
>
> > >>>> First of all: there isn't anything different wrt the LCG-CE. Also for
>
> > >>>> the LCG-CE the exit code that you see in the pbs log file is the one
>
> > >>>> of the job wrapper (jw), and not the one of the user job, because it
>
> > >>>> is the jw that is executed in the batch system.
>
> > >>>> As I said, the jobwrapper is a script. Oversimplifying it, it is
>
> > >>>> something like:
>
> > >>>>
>
> > >>>> #/bin/sh
>
> > >>>> < prepare exection env in WN>
>
> > >>>> <get ISB>
>
> > >>>> <run user job>
>
> > >>>> <put OSB>
>
> > >>>>
>
> > >>>> If this script runs properly, it returns 0 as exit code, and not the
>
> > >>>> exit code of the user job. Again there is the very same scenario in
>
> > >>>> the jw used for the LCG-CE.
>
> > >>>> A value different than 0 means that there was a problem in the
>
> > >>>> execution of the job wrapper (e.g. a problem with sandbox transfers)
>
> > >>>
>
> > >>> That is the traditional view indeed.
>
> > >>>
>
> > >>>> User job exit code is not hidden: it is returned in
>
> > >>>> glite-ce-job-status output, in wms-job-status, in wms-logging-info.
>
> > >>>> It was supposed to be reported also in the glite-ce-cream.log:
>
> > >>>> investigating why this is not the case.
>
> > >>>>
>
> > >>>> The management of jobs finished with an exit code <> 0 is something
>
> > >>>> that was discussed several years ago, in the days of Datagrid. It was
>
> > >>>> decided that they should consider as successfully done (so e.g. the
>
> > >>>> WMS shouldn't trigger a resubmission) but the exit code <> 0 should
>
> > >>>> be returned to the user so she can investigate.
>
> > >>>
>
> > >>> Even that could be discussed again: since the payload may have failed
>
> > >>> due to a problem with the site (e.g. full file system), a resubmission
>
> > >>> could be desirable if the JDL allows it. We may want to be careful
>
> > >>> there and make that behavior depend on a new JDL attribute.
>
> > >>>
>
> > >>>> I don't fully understand what is the RFE here. To have the jw returns
>
> > >>>> with the user job exit code (so that this value is reported in the PBS
>
> > >>>> log file) ?
>
> > >>>
>
> > >>> Right. It would seem nice if:
>
> > >>>
>
> > >>> - the site admin could configure that behavior;
>
> > >>> - the WMS could still distinguish between job wrapper and payload
>
> > >>> problems.
>
> >
>
> > \|||/
>
> > -----------0oo----( o o )----oo0-------------------
>
> > (_)
>
> > INFN Sezione di Padova
>
> > Via Marzolo, 8
>
> > 35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
>
> > Tel: ++39 0499677360 Skype: massimo.sgaravatto
>
> > Fax: ++39 0498275952
>
>
>
>
\|||/
-----------0oo----( o o )----oo0-------------------
(_)
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
Tel: ++39 0499677360 Skype: massimo.sgaravatto
Fax: ++39 0498275952
|