Hi Bence,
can you check the poll subroutine in the job manager script
/opt/globus/lib/perl/Globus/GRAM/JobManager/lcgsge.pm
I've seen cases where differing sge versions lead to the poll returning
a running state when a job is actually finished.
From the gram_job_mgr_.log you attached it seems like this may be the
case to me.
cheers
johnk
Somhegyi Bence wrote:
> Hi Goncalo,
>
> Thank you very much for your suggestions. I looked into the things
> described in 1) and 2) but they shouldn't cause the problem. I checked
> the WMS SandBoxDir, the job output were there, so as the
> Maradona.output, containing the job and its wrapper status:
> Take token:
> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000004:APP=000000:LBS=000000
>
> job exit status = 0
> jw exit status = 0
>
>
> One more thing, the gram_job_mgr_.log is appending continously with
> this every 10 seconds:
> 4/15 08:46:36 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
> 4/15 08:46:36 JMI: testing job manager scripts for type fork exist and
> permissions are ok.
> 4/15 08:46:36 JMI: completed script validation: job manager type is fork.
> 4/15 08:46:36 JMI: in globus_gram_job_manager_poll()
> 4/15 08:46:36 JMI: local stdout filename =
> /home/hungrid041/.globus/job/grid236.kfki.hu/18805.1239776664/stdout.
> 4/15 08:46:36 JMI: local stderr filename = /dev/null.
> 4/15 08:46:36 JMI: poll: seeking:
> https://grid236.kfki.hu:20004/18805/1239776664/
> 4/15 08:46:36 JMI: poll_fast: Monitoring file
> /opt/globus/tmp/grid_manager_monitor_agent_log.20240 looks out of date.
> 4/15 08:46:36 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
> scripts)
> 4/15 08:46:36 JMI: cmd = poll
> 4/15 08:46:36 JMI: returning with success
> Wed Apr 15 08:46:36 2009 JM_SCRIPT: New Perl JobManager created.
> Wed Apr 15 08:46:36 2009 JM_SCRIPT: Using jm supplied job dir:
> /home/hungrid041/.globus/job/grid236.kfki.hu/18805.1239776664
> Wed Apr 15 08:46:36 2009 JM_SCRIPT: polling job 18820
> 4/15 08:46:36 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
> 4/15 08:46:36 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
> 4/15 08:46:46 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
>
> I suppose the problem is on the WMS level.
>
> About the glite-lb problem:
> The output of rpm -q | grep glite-lb on the CE is:
> glite-lb-logger-1.4.10-1.slc4
> glite-lb-common-6.1.2-1.slc4
> glite-lb-client-3.2.1-1.slc4
> glite-lb-client-interface-3.2.0-1.slc4
>
> and on the WN:
> glite-lb-common-6.1.2-1.slc4
> glite-lb-client-3.2.1-1.slc4
> glite-lb-client-interface-3.2.0-1.slc4
>
> Thanks in advance.
>
> Cheers,
> Bence
>
>
> On Tue, 14 Apr 2009, Gonçalo Borges wrote:
>
>> Hi Somhegyi...
>>
>> From the symptoms you describe, the fact that you are running SGE in
>> your CE is irrelevant. Either there is an incoherence in the gLite
>> middleware at the CE level or at the WMS level.
>>
>> 1) First of all, check that ntpd is up and running in your cluster
>> and that your machines are sync. I've seen incoherence problems in
>> several interacting grid services due to small time shifts.
>>
>> 2) The notification to the WMS, by default, takes a long time. The CE
>> is tuned for a long production jobs. You may want to modify
>> globus-gma configuration to make job state updates faster. Add the
>> following parameters to /opt/globus/etc/globus-gma.conf:
>>
>> tick 30
>> stateage 30
>>
>> It will tell globus-gma to refresh job list and job states every 30
>> seconds. But for production CE with high load I would really set both
>> parameters to a higher values like 100/300 or 300/600 (default).
>>
>> 3) Try to submit using a different WMS...
>>
>> Cheers
>> Goncalo
>>
|