I looked into what pbs was doing, and found that the standard is, if the error code is higher than 127, one should look at the lower order bits to find out what unix signal terminated the job. For the famous signal 271,
127 & 271 = 15 (SIGTERM)
what I could not find out is why sometimes various higher order bits are set. One indication is that the "128" bit is set if a core was dumped. But why the "256" bit would be set, no idea.
takeaway : please document well whatever convention and choice is made.
JT
On Sep 14, 2011, at 14:19 , Pablo Fernandez wrote:
> Hehe, this is quite strange (and a bit offtopic already), but I found 34:
> http://www.cs.pitt.edu/~alanjawi/cs449/code/shell/UnixSignals.htm
> But looking a bit further, wikipedia says 31, and wikipedia is never wrong :)
>
>
> Returning to the original topic, the only way to avoid clashes is, I guess, to define a standard. And I think "Grid" is big enough to define its standards. glExec has already done so (applause), and I, from my mortal condition, beg other components to do the same.
>
> BR/Pablo
>
>
> On Wednesday 14 September 2011 14:08:45 you wrote:
> > Hi,
> >
> > you mean 31, right? The highest standard signal in POSIX and Linux is 31
> > (actual meaning differs), but there are also realtime signals in a
> > basically unknown range and I don't think there is guarantee that an OS
> > cannot implement higher signals, only that they need the ones till 31.
> > See man signal(7) for details.
> > Furthermore, even apart from the signals, staying outside the known exit
> > code ranges is not a guarantee for no clashes, since you don't know what
> > others use: it's perfectly acceptable if a userjob returns e.g. 196 or
> > something like that. So basically the best you can do is use something
> > large to minimize the risk (e.g. gLExec uses 201-204) and include one
> > for clashes (gLExec uses 204).
> >
> > Cheers,
> >
> > Mischa
> >
> > On Wed, Sep 14, 2011 at 01:24:29PM +0200, Pablo Fernandez wrote:
> > > So, there are 34 Unix signals, this means Cream and other middleware
> > > pieces could use from 163 to 254.
> > >
> > > On Wednesday 14 September 2011 11:24:17 you wrote:
> > > > Hi,
> > > >
> > > >
> > > >
> > > > for gLExec we implemented a similar scenario, but we included one
> > > >
> > > > special exit code indicating a clash between that of the child and
> > >
> > > any
> > >
> > > > of the gLExec exit codes.
> > > >
> > > > Furthermore, it's also nice to keep in mind that most shells return
> > > >
> > > > 128+n for a child exited via a signal, where n is the signal number
> > >
> > > (so
> > >
> > > > 139 for a SEGV). See http://tldp.org/LDP/abs/html/exitcodes.html for
> > > >
> > > > bash documentation on exit codes.
> > > >
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Mischa
> > > >
> > > > On Tue, Sep 13, 2011 at 04:32:04PM +0100, Alessandra Forti wrote:
> > > > > For c (i.e. bash) 271 = 15 remainder of 271/256. Even if you use
> > >
> > > big
> > >
> > > > > numbers the mechanism is the same.
> > > > >
> > > > > cheers
> > > > >
> > > > > alessandra
> > > > >
> > > > > On 13/09/2011 16:25, Pablo Fernandez wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > > > Instead, there are things that can be done. Cream introduces
> > >
> > > the
> > >
> > > > > grid
> > > > >
> > > > > > > authentication layer, and a couple of services, and then the
> > >
> > > job
> > >
> > > > > > > wrapper... so you can create new exit codes for those: - Can't
> > >
> > > get
> > >
> > > > > Cream
> > > > >
> > > > > > > sandbox: exit code 120
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - Can't set environment in WN: exit code 121
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > And then leave the job return to pbs what it wants to return,
> > >
> > > and
> > >
> > > > > we, as
> > > > >
> > > > > > > sysadmins, can tell: - If the job was successful: exit code 0
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - If there was an unknown failure: exit code between 1 and 119
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - If there was a problem with the sandbox: exit code 120
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - If there was a problem with the WN env: exit code 121
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - If there was a problem with the batch system: exit code
> > > > >
> > > > > 127/271/etc.
> > > > >
> > > > > > > depending on your batch system. This we already have.
> > > > > >
> > > > > > not clear what you mean here, you talk about exit codes, leave
> > >
> > > job
> > >
> > > > > return
> > > > >
> > > > > > to pbs, we sysadmins can tell, ???? what are you suggesting? are
> > >
> > > you
> > >
> > > > > > making use of two error codes as David suggested, or are you
> > > > > > only
> > > > > >
> > > > > >
> > > > > >
> > > > > > referring to the normal unix job exit status as reported by PBS?
> > > > > >
> > > > > > does
> > > > > >
> > > > > >
> > > > > >
> > > > > > this mean that all users will be forced to constrain their error
> > > > >
> > > > > codes
> > > > >
> > > > > > between 1 and 119? How will you enforce this? what will happen
> > > > > > if
> > >
> > > a
> > >
> > > > > user
> > > > >
> > > > > > code decides to throw exit status 121?
> > > > >
> > > > > Just one exit code. What I mean is that the Job Wrapper could let
> > >
> > > the
> > >
> > > > > user exit code pass through (to pbs), and that the Job Wrapper
> > >
> > > should
> > >
> > > > > have pre-defined exit codes itself, that apply with bigger
> > >
> > > preference.
> > >
> > > > > And yes, there would be a need to define those exit codes clearly,
> > >
> > > so
> > >
> > > > > users don't use them (even though they could potentially do). Same
> > > > >
> > > > > applies regularly in PBS. If a job exits with a 271 (if possible,
> > > > >
> > > > > isn't the maximum 255?) then there's no way to distinguish that
> > >
> > > from
> > >
> > > > > PBS issues.
> > > > >
> > > > >
> > > > >
> > > > > BR/Pablo
>
>
|