Print

Print


Stijn De Weirdt wrote:

> hi all,
> 
> i have a problem with some jobs and i can't find the cause.
> 
> the jobs run and do their work (ie write some output file to the storage, indicating correct execution) BUT the job status looks strange
> (full output below).
> at the end no job output can be retrieved and if i look on the RB (LCG part of glite3), no input or output sandbox seem to exist (anymore).
> 
> basically the logging gives statuses with wrong timeorder but actaully seems to do whatever is listed:
> 
> short overview of status:
> accepted/enqueud/../transfer (and probably waiting in queue on CE from then on, for approx 5 hours)
> running (starts running for 6 hours, as expected)
> done (exit code 0!)
> accepted/transfer (with timestamp of first accepted/../transfer sequeunce, ie of 10 hours earlier)
> running (9 hours later, probably waiting in queue)
> done (2 minutes later, exit status 1!)
> resubmission
> abort (because Job RetryCount (0) hit).
> 
> i assumed that the status "Done/exitcode 0" meant a succesful job?
> (i already checked the clock on the RB/CE/UI/WN, they are all ok (as expected from ntpd)).
> 
> any hints to look for this? (unless i'm missing something here)

Did you see this error:

> - [...]
>         ---
>  Event: Done
> - exit_code               =    1
> - host                    =    laranja.iihe.ac.be
> - level                   =    SYSTEM
> - priority                =    asynchronous
> - reason                  =    Got a job held event, reason: Globus error 158: the job manager could not lock the state lock file

That typically happens if the user's DN got mapped differently between
the time the job was submitted and the time it was cleaned up:

1. The DN may have got added to or removed from the list of "sgm" or "prd" users
    (pool account <--> sgm/prd account)

2. The user used _different_ proxies for jobs sent to the same RB:
    grid vs. VOMS proxies, or VOMS proxies with vs. without a role, etc.
    The RB cannot handle that.  Always use the same proxy, otherwise
    any unfinished jobs may be lost (allow at least for 1 hour for any
    active grid_monitor processes on the CEs to die out).