Hi Kashif,
I think I have discovered why jobs stay in the running state:
your CE cannot query the state of individual jobs!
Example:
-------------------------------------------------------------------------------
[egop010@ngsce-test ~]$ echo sleep 123 | qsub
563442.ngs
-------------------------------------------------------------------------------
[egop010@ngsce-test ~]$ qstat -f 563442.ngs
Cannot connect to specified server host 'ngs'.
qstat: cannot connect to server ngs (errno=111)
-------------------------------------------------------------------------------
[egop010@ngsce-test ~]$ qstat -f 563442
qstat: Unknown Job Id 563442.master.beowulf.cluster
-------------------------------------------------------------------------------
[egop010@ngsce-test ~]$ qstat -a | grep 563442.ngs
563442.ngs egop010 workq STDIN 1673 1 1 -- 48:00
R 00:00
-------------------------------------------------------------------------------
In /opt/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm there is this line:
$_ = (grep(/job_state/, $self->pipe_out_cmd($qstat, '-f', $job_id)))[0];
That command will fail, so the job is never seen to make any progress.
You should fix the setup to allow "qstat -f job_ID" to succeed somehow.
It fails because connections from the CE to port 15001 on "ngs" are refused.
In principle pbs.pm could be modified to resort to "qstat -a | grep" instead,
but that would seem a hack.
|