[log in to unmask] wrote:
> Hi Tomas,
>
>>> So, the job is reported as Done by the job wrapper itself,
>>> but the LogMonitor daemon on the WMS does not see that state
>>> reported by the grid_monitor running on the CE.
>>>
>>> This could have various causes. For example, the batch system
>>> may keep reporting the job as running, even after it finished.
>>> I found your Torque configuration causes completed jobs to be
>>> reported in the 'C' state: for how long?
>> For about half a day. The job stays in Running even after it disappears
>> from qstat output (it shouldn't be this issue see below).
>
> Can you increase the debug level in
> /opt/globus/etc/globus-job-manager-marshal.conf to 2 and send a SIGHUP
> to the globus-job-manager-marshal master process?
> Then look into /opt/globus/var/log/globus-job-manager-marshal.log
> for additional messages/warnings/errors from the job manager.
I have done so and checked the log and straced the globus-job-manager-marshal.
It helped to orientate myself in logs. I think the problem shows up in
gram_job_mgr_<ID>.log:
Thu Jun 19 09:29:17 2008 JM_SCRIPT: New Perl JobManager created.
Thu Jun 19 09:29:17 2008 JM_SCRIPT: Using jm supplied job dir: /home/dteam001/.globus/job/ce2.egee.cesnet.cz/1669.1213860305
Thu Jun 19 09:29:17 2008 JM_SCRIPT: polling job 1684
6/19 09:29:17 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
6/19 09:29:17 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
6/19 09:29:27 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
6/19 09:29:27 JMI: testing job manager scripts for type fork exist and permissions are ok.
6/19 09:29:27 JMI: completed script validation: job manager type is fork.
6/19 09:29:27 JMI: in globus_gram_job_manager_poll()
6/19 09:29:27 JMI: local stdout filename = /home/dteam001/.globus/job/ce2.egee.cesnet.cz/1669.1213860305/stdout.
6/19 09:29:27 JMI: local stderr filename = /dev/null.
6/19 09:29:27 JMI: poll: seeking: https://ce2.egee.cesnet.cz:20002/1669/1213860305/
6/19 09:29:27 JMI: poll_fast: ******** Failed to find https://ce2.egee.cesnet.cz/1669/1213860305/
6/19 09:29:27 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts)
6/19 09:29:27 JMI: cmd = poll
6/19 09:29:27 JMI: returning with success
This snippet keeps repeating in the log every 10 seconds.
The grid_manager_monitor_agent_log really does not contain mentioned string, its content:
1213863891 1213863891
https://ce2.egee.cesnet.cz:20005/30388/1213703208/ 1
https://ce2.egee.cesnet.cz:20007/20753/1213692787/ 1
https://ce2.egee.cesnet.cz:20008/25851/1213694320/ 1
GRIDMONEOF
> If that does not provide more clues, you can do the same with
> /opt/globus/etc/globus-gass-cache-marshal.conf and the
> globus-gass-cache-marshal master process, then look into
> /opt/globus/var/log/globus-gass-cache-marshal.log.
--
Tomas Kouba
Institute of Physics, Academy of sciences of the Czech Republic
|