Hi Goncalo,
Thank you very much for your suggestions. I looked into the things
described in 1) and 2) but they shouldn't cause the problem. I checked the
WMS SandBoxDir, the job output were there, so as the Maradona.output,
containing the job and its wrapper status:
Take token:
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000004:APP=000000:LBS=000000
job exit status = 0
jw exit status = 0
One more thing, the gram_job_mgr_.log is appending continously with this
every 10 seconds:
4/15 08:46:36 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
4/15 08:46:36 JMI: testing job manager scripts for type fork exist and
permissions are ok.
4/15 08:46:36 JMI: completed script validation: job manager type is fork.
4/15 08:46:36 JMI: in globus_gram_job_manager_poll()
4/15 08:46:36 JMI: local stdout filename =
/home/hungrid041/.globus/job/grid236.kfki.hu/18805.1239776664/stdout.
4/15 08:46:36 JMI: local stderr filename = /dev/null.
4/15 08:46:36 JMI: poll: seeking:
https://grid236.kfki.hu:20004/18805/1239776664/
4/15 08:46:36 JMI: poll_fast: Monitoring file
/opt/globus/tmp/grid_manager_monitor_agent_log.20240 looks out of date.
4/15 08:46:36 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)
4/15 08:46:36 JMI: cmd = poll
4/15 08:46:36 JMI: returning with success
Wed Apr 15 08:46:36 2009 JM_SCRIPT: New Perl JobManager created.
Wed Apr 15 08:46:36 2009 JM_SCRIPT: Using jm supplied job dir:
/home/hungrid041/.globus/job/grid236.kfki.hu/18805.1239776664
Wed Apr 15 08:46:36 2009 JM_SCRIPT: polling job 18820
4/15 08:46:36 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
4/15 08:46:36 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
4/15 08:46:46 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
I suppose the problem is on the WMS level.
About the glite-lb problem:
The output of rpm -q | grep glite-lb on the CE is:
glite-lb-logger-1.4.10-1.slc4
glite-lb-common-6.1.2-1.slc4
glite-lb-client-3.2.1-1.slc4
glite-lb-client-interface-3.2.0-1.slc4
and on the WN:
glite-lb-common-6.1.2-1.slc4
glite-lb-client-3.2.1-1.slc4
glite-lb-client-interface-3.2.0-1.slc4
Thanks in advance.
Cheers,
Bence
On Tue, 14 Apr 2009, Gonçalo Borges wrote:
> Hi Somhegyi...
>
> From the symptoms you describe, the fact that you are running SGE in your CE
> is irrelevant. Either there is an incoherence in the gLite middleware at the
> CE level or at the WMS level.
>
> 1) First of all, check that ntpd is up and running in your cluster and that
> your machines are sync. I've seen incoherence problems in several interacting
> grid services due to small time shifts.
>
> 2) The notification to the WMS, by default, takes a long time. The CE is
> tuned for a long production jobs. You may want to modify globus-gma
> configuration to make job state updates faster. Add the following parameters
> to /opt/globus/etc/globus-gma.conf:
>
> tick 30
> stateage 30
>
> It will tell globus-gma to refresh job list and job states every 30 seconds.
> But for production CE with high load I would really set both parameters to a
> higher values like 100/300 or 300/600 (default).
>
> 3) Try to submit using a different WMS...
>
> Cheers
> Goncalo
>
|