> We see this problem sometimes with jobs whose proxy has expired before
> they start running, so they start but can't do anything so get re-queued
> in torque into the W state.
I don't know how to check what proxy one of these odd-state jobs has
(advice welcome), but yes they queued a few days ago, for some reason the
WN have been full of very long-running jobs so they had to wait.
Job: 102403.lcgce03.phy.bris.ac.uk
06/14/2010 01:43:21 S enqueuing into medium, state 1 hop 1
06/14/2010 01:43:21 S Job Queued at request of [log in to unmask], owner =
[log in to unmask], job name = STDIN, queue = medium
06/16/2010 01:25:53 S Job Modified at request of [log in to unmask]
06/16/2010 01:25:53 S Job Run at request of [log in to unmask]
06/16/2010 01:25:53 S MOM rejected modify request, error: 15001
06/16/2010 04:26:40 S Job Run at request of [log in to unmask]
06/16/2010 04:26:40 S MOM rejected modify request, error: 15001
06/16/2010 04:56:46 S Job Run at request of [log in to unmask]
06/16/2010 04:56:46 S MOM rejected modify request, error: 15001
diagnose -j is saying things like
WARNING: job '102403' has failed to start 11 times
> Are the exec_host for the W jobs all the same?
tracejob doesn't show any exec_host for the W jobs at all - is there some
other way to check? The pbs_server logs just log it as :Q:
Very Grateful for Advice!
|