Hello Christoph,
03.06.2009, Χ 23:41, Christoph Wissing ΞΑΠΙΣΑΜ(Α):
>>> root@grid-ce2: [~] ps auxw | egrep globus-gma | wc -l
>>> 76
>>> root@grid-ce2: [~] ps auxw | egrep globus-gma | grep defunct | wc -l
>>> 71
>>
>> That probably is OK. The defunct processes are cleaned up at the end
>> of each main cycle. The problem with earlier versions was that some
>> processes were never cleaned up, so the list could grow steadily.
>
> Our experience says that the CE is basically stuck if there such a
> high
> number of defunct globus-gma processes. According to the logs every
> minute (can be steered in config) some hanging processes a getting
> killed. Users observe a large number of jobs in the status "scheduled"
> in WMS, but they do not arrive in the batch system. The only thing
> that
> helps is restarting the globus-gma.
On grid-ce2 I see a lot of jobs (> 12000) registered for user
ilcprd003. Is that normal?
Most of the hung globus-gma processes belong to this user. No surprise
that poll cycle takes ages: in a worst case with 1 minute timeout it
will take globus-gma 12000 minutes (more than a week) to dig through
the list of jobs for this user :)
We have to understand the reason of these timeouts. As a first measure
in globus-gma.conf please increase 'tout' to something like 120,
increase 'stateage' to 1200 and set 'statefact' to 4
--
Cheers,
Andrey Kiryanov.
|