Hi Andrey....
Sorry for the late reply...
>> [dteam096@ce01-atlas ~]$ ps xuawww | grep globus-gma | wc -l
>> 11364
>>
>> This is completely new to me. Googling for it I see that there is an
>> open GGUS ticket but no solution provided yet
>> (https://gus.fzk.de/ws/ticket_info.php?ticket=42981) Reading the
>> history, I've checked that GLOBUS_GMA=true.
>>
>> I think this was preventing new jobs to arrive to the CE, and since
>> we are participating in an ATLAS test tomorrow, I decided to reboot
>> the machine (this is a dedicated ATLAS lcg-CE).
>> It fixed the issue for now but I'll like to understand why is it
>> happening...
>
> globus-gma zombies appear only if job polls time out (this is a bug
> and it is already fixed, but not released yet). It really means that
> your CE or batch system is overloaded with something. High load on NFS
> server with shared home directories may also cause such problems.
I do not see any high loads neither in SGE batch system nor in lcg-CE.
On the lcg-CE nothing is mounted via NFS. Indeed, this only occurs in
one of our 3 CEs, the one dedicated to ATLAS. Maybe it is related with a
problem we have seen with ATLAS pilot jobs because these pilot jobs are
started in our cluster but exit almost immediately because they do not
find the necessary ATLAS data sets. Do you think that some kind of race
condition can be occurring
>> Any hits?
>
> Please check /opt/globus/var/log/globus-gma.log file for timeout errors.
> If they are there, add 'tout 120' parameter to
> /opt/globus/etc/globus-gma.conf file and restart globus-gma.
This are the messages I see in /opt/globus/var/log/globus-gma.log. The
warning messages only end when I restart globus-gma:
Tue Dec 2 17:48:40 2008:16646:WARN: Killing hung poll process 9973
Tue Dec 2 17:49:10 2008:16646:WARN: Killing hung poll process 9978
Tue Dec 2 17:49:40 2008:16646:WARN: Killing hung poll process 10008
Tue Dec 2 17:50:10 2008:16646:WARN: Killing hung poll process 10094
Tue Dec 2 17:50:40 2008:16646:WARN: Killing hung poll process 10138
Tue Dec 2 17:51:10 2008:16646:WARN: Killing hung poll process 10175
Tue Dec 2 17:51:40 2008:16646:WARN: Killing hung poll process 10285
Tue Dec 2 17:52:11 2008:16646:WARN: Killing hung poll process 10309
Tue Dec 2 17:52:41 2008:16646:WARN: Killing hung poll process 10400
Tue Dec 2 17:53:11 2008:16646:WARN: Killing hung poll process 10430
Tue Dec 2 17:53:41 2008:16646:WARN: Killing hung poll process 10517
Tue Dec 2 17:54:11 2008:16646:WARN: Killing hung poll process 10541
Tue Dec 2 17:54:41 2008:16646:WARN: Killing hung poll process 10625
Tue Dec 2 17:55:11 2008:16646:WARN: Killing hung poll process 10649
Tue Dec 2 17:55:41 2008:16646:WARN: Killing hung poll process 10733
Tue Dec 2 17:56:11 2008:16646:WARN: Killing hung poll process 10759
Tue Dec 2 17:56:41 2008:16646:WARN: Killing hung poll process 10859
Tue Dec 2 17:57:11 2008:16646:WARN: Killing hung poll process 10895
Tue Dec 2 18:01:18 2008:16646:Terminating
Tue Dec 2 18:01:18 2008:11790:Initializing
Tue Dec 2 18:01:18 2008:11790:Loaded jobmanager lcgsge
Tue Dec 2 18:01:18 2008:11790:Loaded jobmanager fork
Tue Dec 2 18:01:18 2008:11790:Reading grid service jobmanager
Tue Dec 2 18:01:18 2008:11790:Reading config file
/opt/globus/etc/globus-job-manager.conf
Tue Dec 2 18:01:18 2008:11790:Reading grid service jobmanager-fork
Tue Dec 2 18:01:18 2008:11790:Reading config file
/opt/globus/etc/globus-job-manager.conf
Tue Dec 2 18:01:18 2008:11790:Reading grid service jobmanager-lcgsge
Tue Dec 2 18:01:18 2008:11790:Reading config file
/opt/globus/etc/globus-job-manager.conf
Tue Dec 2 18:01:18 2008:11790:Job state directories:
/opt/globus/tmp/gram_job_state
Tue Dec 2 18:01:18 2008:11790:Ready
Nevertheless I have added the 'tout 120' parameter to
/opt/globus/etc/globus-gma.conf file and restarted globus-gma to see if
the situation improves.
Cheers
Gonçalo
|