Hi Winnie
I can't offer much help in solving this, but I've seen it too. It
particularly seemed to cripple Steve Lloyd's jobs sent through our
resource broker.
What seems to happen is that the gatekeeper's job manager process for
this job seems to die on the CE. If it dies very early then the job
stays "submitted" forever. If it manages to get the job to the batch
system, then the job runs, but, for the RB, stays "running" forever.
However, the output can never be retrieved and the job is useless.
As I said, I don't understand why it happens. The one time I managed
to hood an strace onto the jobmanager the bloody thing ran perfectly.
It seems to be somehow related to the RB used, and does appear to
depend on the user too. In other words it's a bugger to pin down.
Cheers
Graeme
On 20 Sep 2007, at 08:54, Winnie Lacesso wrote:
> Greetings,
>
> A couple of tiny test jobs submitted to our local grid yesterday
> afternoon
> are finished according to the WN, CE & RB logs, but edg-job-status
> still says
> RUNNING (12 hours now) :
>
> BOOKKEEPING INFORMATION:
> Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/
> G4in9lhRFQaM3ux94LFsIQ
> Current Status: Running
> Status Reason: unavailable
> reached on: Wed Sep 19 15:46:24 2007
>
> PBS on CE knows it's done:
> Job: 123542.lcgce01.phy.bris.ac.uk
> 09/19/2007 16:42:59 S enqueuing into short, state 1 hop 1
> 09/19/2007 16:45:57 S Job Run at request of
> [log in to unmask]
> 09/19/2007 16:47:09 S dequeuing from short, state COMPLETE
>
> Getting the logging info for the jobs show DONE at the bottom:
>
> Event: Done
> - exit_code = 0
> - timestamp = Wed Sep 19 15:46:52 2007
>
> RB log agrees:
> 000 (495573.000.000) 09/19 16:42:03 Job submitted from host:
> <130.246.183.184:33410> (https://lcgrb01.gridpp.rl.ac.uk:9000/
> G4in9lhRFQaM3ux94LFsIQ)
> 017 (495573.000.000) 09/19 16:42:19 Job submitted to Globus
> 001 (495573.000.000) 09/19 16:48:20 Job executing on host:
> lcgce01.phy.bris.ac.uk
> 005 (495573.000.000) 09/19 16:49:39 Job terminated.
>
> Of course since it's not DONE, trying to get output to debug what
> is wrong
> with the job doesn't work:
>
> **** Warning: NS_JOB_OUTPUT_NOT_READY ****
> The OutputSandbox files for job "https://lcgrb01.gridpp.rl.ac.uk:
> 9000/G4in9lhRFQaM3ux94LFsIQ"
> are not yet ready for retrieval. Please wait that the job enters
> the "Done" status.
>
> Is this a common problem?
> Old ROLLOUT suggests time unsynchronization might be the problem
> but i) all
> other jobs are running fine; ii) ntp seems fine.
>
> How does 'the system' get nudged so that the various parts that
> know the job
> is DONE push whatever is clinging to status=RUNNING to update
> itself to DONE
> so the job output can be got for debug?
>
> There's nothing obvious in the LCG troubleshooting guide, pointers
> welcome!
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|