Stijn De Weirdt wrote:
> hi maarten,
>
> i saw that error, but i thought the fact that it submitted the job a second time was the cause of this. (i still don't understand why it tried to do that.)
When a job fails (from the RB/WMS perspective), it is always sent back
to the Workload Manager daemon for resubmission. In this case the WM
found the max. number of resubmissions to be zero, so the job aborted.
> thanks anyway, (i'll check the proxies with the user)
You can also look on gridce.iihe.ac.be in the "gram_*.log" files in the
home directory of the account the DN got mapped to: they may show why
the state lock file could not be locked (e.g. "Permission denied").
To find the pool account(s) in this case:
--------------------------------------------------------------------------------
for i in `ls -li /etc/grid-security/gridmapdir/ | awk '/heyninck/ { print $1 }'`
do
ls -li /etc/grid-security/gridmapdir/ | awk "\$1 == $i { print \$NF }"
done
--------------------------------------------------------------------------------
Note: every different set of VOMS attributes gets its own pool account!
>
> stijn
>
> Maarten Litmaath wrote:
>
>>Stijn De Weirdt wrote:
>>
>>
>>>hi all,
>>>
>>>i have a problem with some jobs and i can't find the cause.
>>>
>>>the jobs run and do their work (ie write some output file to the
>>>storage, indicating correct execution) BUT the job status looks strange
>>>(full output below).
>>>at the end no job output can be retrieved and if i look on the RB (LCG
>>>part of glite3), no input or output sandbox seem to exist (anymore).
>>>
>>>basically the logging gives statuses with wrong timeorder but actaully
>>>seems to do whatever is listed:
>>>
>>>short overview of status:
>>>accepted/enqueud/../transfer (and probably waiting in queue on CE from
>>>then on, for approx 5 hours)
>>>running (starts running for 6 hours, as expected)
>>>done (exit code 0!)
>>>accepted/transfer (with timestamp of first accepted/../transfer
>>>sequeunce, ie of 10 hours earlier)
>>>running (9 hours later, probably waiting in queue)
>>>done (2 minutes later, exit status 1!)
>>>resubmission
>>>abort (because Job RetryCount (0) hit).
>>>
>>>i assumed that the status "Done/exitcode 0" meant a succesful job?
>>>(i already checked the clock on the RB/CE/UI/WN, they are all ok (as
>>>expected from ntpd)).
>>>
>>>any hints to look for this? (unless i'm missing something here)
>>
>>Did you see this error:
>>
>>
>>>- [...]
>>> ---
>>> Event: Done
>>>- exit_code = 1
>>>- host = laranja.iihe.ac.be
>>>- level = SYSTEM
>>>- priority = asynchronous
>>>- reason = Got a job held event, reason: Globus
>>>error 158: the job manager could not lock the state lock file
>>
>>That typically happens if the user's DN got mapped differently between
>>the time the job was submitted and the time it was cleaned up:
>>
>>1. The DN may have got added to or removed from the list of "sgm" or
>>"prd" users
>> (pool account <--> sgm/prd account)
>>
>>2. The user used _different_ proxies for jobs sent to the same RB:
>> grid vs. VOMS proxies, or VOMS proxies with vs. without a role, etc.
>> The RB cannot handle that. Always use the same proxy, otherwise
>> any unfinished jobs may be lost (allow at least for 1 hour for any
>> active grid_monitor processes on the CEs to die out).
>>
|