Greetings,
A couple of tiny test jobs submitted to our local grid yesterday afternoon
are finished according to the WN, CE & RB logs, but edg-job-status still says
RUNNING (12 hours now) :
BOOKKEEPING INFORMATION:
Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/G4in9lhRFQaM3ux94LFsIQ
Current Status: Running
Status Reason: unavailable
reached on: Wed Sep 19 15:46:24 2007
PBS on CE knows it's done:
Job: 123542.lcgce01.phy.bris.ac.uk
09/19/2007 16:42:59 S enqueuing into short, state 1 hop 1
09/19/2007 16:45:57 S Job Run at request of [log in to unmask]
09/19/2007 16:47:09 S dequeuing from short, state COMPLETE
Getting the logging info for the jobs show DONE at the bottom:
Event: Done
- exit_code = 0
- timestamp = Wed Sep 19 15:46:52 2007
RB log agrees:
000 (495573.000.000) 09/19 16:42:03 Job submitted from host: <130.246.183.184:33410> (https://lcgrb01.gridpp.rl.ac.uk:9000/G4in9lhRFQaM3ux94LFsIQ)
017 (495573.000.000) 09/19 16:42:19 Job submitted to Globus
001 (495573.000.000) 09/19 16:48:20 Job executing on host: lcgce01.phy.bris.ac.uk
005 (495573.000.000) 09/19 16:49:39 Job terminated.
Of course since it's not DONE, trying to get output to debug what is wrong
with the job doesn't work:
**** Warning: NS_JOB_OUTPUT_NOT_READY ****
The OutputSandbox files for job "https://lcgrb01.gridpp.rl.ac.uk:9000/G4in9lhRFQaM3ux94LFsIQ"
are not yet ready for retrieval. Please wait that the job enters the "Done" status.
Is this a common problem?
Old ROLLOUT suggests time unsynchronization might be the problem but i) all
other jobs are running fine; ii) ntp seems fine.
How does 'the system' get nudged so that the various parts that know the job
is DONE push whatever is clinging to status=RUNNING to update itself to DONE
so the job output can be got for debug?
There's nothing obvious in the LCG troubleshooting guide, pointers welcome!
|