Yes, we see this delay (sometime indefinite) for multiple users over
multiple WMS (plus condor). We've disabled globus-gma to see if old
grid_monitor does any better. It would be useful if there are some
docs describing the various scripts involved. Something isn't dealing
with the LRMS state correctly.
globus-gma logs look fine. eg:
Thu Feb 5 14:17:46 2009:24440:Poll process 5496 terminated (0:1)
Thu Feb 5 14:17:46 2009:24440:Job
https://fal-pygrid-44.lancs.ac.uk:20047/2947/1232446600/ updated,
state 1
Cheers,
Peter
2009/2/5 Maarten Litmaath <[log in to unmask]>:
> Peter Love wrote:
>
>> We're having problem with jobs submitted by both WMS and condor
>> whereby the job runs OK but the WMS/condor state remains in RUNNING
>> for a very long time. With the new stuff in lcg-CE 3.1, which
>> component should I be looking at? Any technical docs on lcg-CE?
>
> There is some documentation on the updates page:
>
> http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
>
> Some of the options are documented in this ticket:
>
> https://gus.fzk.de/ws/ticket_info.php?ticket=35835
>
> Now, the problem you describe can happen due to various causes.
> Some ideas are suggested here (sic):
>
> http://goc.grid.sinica.edu.tw/gocwiki/Jobs_sent_to_some_CE_stay_in_Scheduled_state_forever
>
> Does the problem occur for every user? Multiple WMS nodes?
>
> Look for errors in the globus-gma logs in /opt/globus/var/log;
> you can increase the debug level in /opt/globus/etc/globus-gma.conf
> and restart the daemon.
>
|