JISCMail - LCG-ROLLOUT Archives

hi maarten,

i saw that error, but i thought the fact that it submitted the job a second time was the cause of this. (i still don't understand why it tried to do that.)

thanks anyway, (i'll check the proxies with the user)

stijn

Maarten Litmaath wrote:
> Stijn De Weirdt wrote:
> 
>> hi all,
>>
>> i have a problem with some jobs and i can't find the cause.
>>
>> the jobs run and do their work (ie write some output file to the 
>> storage, indicating correct execution) BUT the job status looks strange
>> (full output below).
>> at the end no job output can be retrieved and if i look on the RB (LCG 
>> part of glite3), no input or output sandbox seem to exist (anymore).
>>
>> basically the logging gives statuses with wrong timeorder but actaully 
>> seems to do whatever is listed:
>>
>> short overview of status:
>> accepted/enqueud/../transfer (and probably waiting in queue on CE from 
>> then on, for approx 5 hours)
>> running (starts running for 6 hours, as expected)
>> done (exit code 0!)
>> accepted/transfer (with timestamp of first accepted/../transfer 
>> sequeunce, ie of 10 hours earlier)
>> running (9 hours later, probably waiting in queue)
>> done (2 minutes later, exit status 1!)
>> resubmission
>> abort (because Job RetryCount (0) hit).
>>
>> i assumed that the status "Done/exitcode 0" meant a succesful job?
>> (i already checked the clock on the RB/CE/UI/WN, they are all ok (as 
>> expected from ntpd)).
>>
>> any hints to look for this? (unless i'm missing something here)
> 
> Did you see this error:
> 
>> - [...]
>>         ---
>>  Event: Done
>> - exit_code               =    1
>> - host                    =    laranja.iihe.ac.be
>> - level                   =    SYSTEM
>> - priority                =    asynchronous
>> - reason                  =    Got a job held event, reason: Globus 
>> error 158: the job manager could not lock the state lock file
> 
> That typically happens if the user's DN got mapped differently between
> the time the job was submitted and the time it was cleaned up:
> 
> 1. The DN may have got added to or removed from the list of "sgm" or 
> "prd" users
>    (pool account <--> sgm/prd account)
> 
> 2. The user used _different_ proxies for jobs sent to the same RB:
>    grid vs. VOMS proxies, or VOMS proxies with vs. without a role, etc.
>    The RB cannot handle that.  Always use the same proxy, otherwise
>    any unfinished jobs may be lost (allow at least for 1 hour for any
>    active grid_monitor processes on the CEs to die out).
>