Hi,
sorry for reviving such an old thread, but I have this problem again...
On 04/22/2010 09:23 AM, Andrey Kiryanov wrote:
>
>> "Also the WMS has a problem with the same DN using different VOMS proxies in parallel: the jobs submitted with one of the proxies to some of the CEs may remain unmonitored for a very long time."
>>
>> I am indeed submitting with three (!) different roles ("pilot" and "production", plus VOMS proxies without role for tests) to the same WMSes.
>> Is there a way to tell whether I am incurring right in this problem? I am submitting to more than ten CEs, but only two of them seem to be affected by it...
>>
> This is a known problem with Condor on WMS. There's a workaround for it on LCG-CEs, but it's not activated by default.
>
> On LCG-CE you need to add `condorfix 1' (without quotes) line to /opt/globus/etc/globus-gma.conf and restart globus-gma daemon with `service globus-gma restart'.
>
The workaround made the trick, it has been working for months in all the
CEs I was submitting to.
Now in one of those CE, the jobs for one of the roles ('production') are
stuck in the 'Scheduled' status again, even if the 'condorfix 1' is
still there. The jobs for the two other roles ('pilot' and 'it/pilot')
are working fine, as well as the 'production' jobs submitted by other users.
The problem showed up after that the CE queues have been drained to
upgrade the WNs' kernel, but no modification was made to the CE itself,
so I can't figure out what has been broken...
I attach an extract of the globus-gma.log file. The problematic account
is 'prdatlas003', and the contact string of a stuck job is e.g.
https://atlasce1.lnf.infn.it:20048/21463/1286188484/
In my understanding, the job status should be polled (by which process?)
and written in the /opt/globus/tmp/gram_job_state/ directory on the CE.
Actually, a file "job.atlasce1.lnf.infn.it.21463.1286188484" is there,
but it's more than 1 day old.
Any hint of what to check?
Thanks,
David
--
David Rebatto
I.N.F.N. - Sezione di Milano
Via Celoria, 16 - 20133 Milano ITALY
tel: +39 02503.17623 e-mail: [log in to unmask]
URL: http://www.mi.infn.it/~rebatto
"There are 10 kinds of people in the world:
those who understand binary and those who don't..."
|