hi maarten, i saw that error, but i thought the fact that it submitted the job a second time was the cause of this. (i still don't understand why it tried to do that.) thanks anyway, (i'll check the proxies with the user) stijn Maarten Litmaath wrote: > Stijn De Weirdt wrote: > >> hi all, >> >> i have a problem with some jobs and i can't find the cause. >> >> the jobs run and do their work (ie write some output file to the >> storage, indicating correct execution) BUT the job status looks strange >> (full output below). >> at the end no job output can be retrieved and if i look on the RB (LCG >> part of glite3), no input or output sandbox seem to exist (anymore). >> >> basically the logging gives statuses with wrong timeorder but actaully >> seems to do whatever is listed: >> >> short overview of status: >> accepted/enqueud/../transfer (and probably waiting in queue on CE from >> then on, for approx 5 hours) >> running (starts running for 6 hours, as expected) >> done (exit code 0!) >> accepted/transfer (with timestamp of first accepted/../transfer >> sequeunce, ie of 10 hours earlier) >> running (9 hours later, probably waiting in queue) >> done (2 minutes later, exit status 1!) >> resubmission >> abort (because Job RetryCount (0) hit). >> >> i assumed that the status "Done/exitcode 0" meant a succesful job? >> (i already checked the clock on the RB/CE/UI/WN, they are all ok (as >> expected from ntpd)). >> >> any hints to look for this? (unless i'm missing something here) > > Did you see this error: > >> - [...] >> --- >> Event: Done >> - exit_code = 1 >> - host = laranja.iihe.ac.be >> - level = SYSTEM >> - priority = asynchronous >> - reason = Got a job held event, reason: Globus >> error 158: the job manager could not lock the state lock file > > That typically happens if the user's DN got mapped differently between > the time the job was submitted and the time it was cleaned up: > > 1. The DN may have got added to or removed from the list of "sgm" or > "prd" users > (pool account <--> sgm/prd account) > > 2. The user used _different_ proxies for jobs sent to the same RB: > grid vs. VOMS proxies, or VOMS proxies with vs. without a role, etc. > The RB cannot handle that. Always use the same proxy, otherwise > any unfinished jobs may be lost (allow at least for 1 hour for any > active grid_monitor processes on the CEs to die out). >