[log in to unmask] wrote:
> Hi Tomas,
>
>> [...]
>> ---
>> Event: Done
>> - Arrived = Wed Jun 18 11:48:10 2008 CEST
>> - Exit code = 0
>> - Host = skurut10-2.egee.cesnet.cz
>> - Source = LRMS
>> - Status code = OK
>> - Timestamp = Wed Jun 18 11:48:10 2008 CEST
>
> So, the job is reported as Done by the job wrapper itself,
> but the LogMonitor daemon on the WMS does not see that state
> reported by the grid_monitor running on the CE.
>
> This could have various causes. For example, the batch system
> may keep reporting the job as running, even after it finished.
> I found your Torque configuration causes completed jobs to be
> reported in the 'C' state: for how long?
For about half a day. The job stays in Running even after it disappears
from qstat output (it shouldn't be this issue see below).
>
> The "pbs" and "lcgpbs" job managers simply ignore that state
> and wait for the job to disappear from the "qstat" output:
> did you change the "lcgpbs" job manager in that respect?
Yes both jobmanagers are patched to understand the 'C' state:
--- /opt/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm.orig 2008-06-16 16:04:46.000000000 +0200
+++ /opt/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm 2008-06-16 16:04:58.000000000 +0200
@@ -680,6 +680,10 @@
{
$state = Globus::GRAM::JobState::ACTIVE;
}
+ elsif(/C/)
+ {
+ $state = Globus::GRAM::JobState::DONE;
+ }
else
{
# This else is reached by an unknown response from pbs.
--
Tomas Kouba
Institute of Physics, Academy of sciences of the Czech Republic
|