Hello David,
David Rebatto wrote:
> Now in one of those CE, the jobs for one of the roles ('production') are
> stuck in the 'Scheduled' status again, even if the 'condorfix 1' is
> still there. The jobs for the two other roles ('pilot' and 'it/pilot')
> are working fine, as well as the 'production' jobs submitted by other
> users.
> The problem showed up after that the CE queues have been drained to
> upgrade the WNs' kernel, but no modification was made to the CE itself,
> so I can't figure out what has been broken...
>
> I attach an extract of the globus-gma.log file. The problematic account
> is 'prdatlas003', and the contact string of a stuck job is e.g.
> https://atlasce1.lnf.infn.it:20048/21463/1286188484/
> In my understanding, the job status should be polled (by which process?)
> and written in the /opt/globus/tmp/gram_job_state/ directory on the CE.
> Actually, a file "job.atlasce1.lnf.infn.it.21463.1286188484" is there,
> but it's more than 1 day old.
>
> Any hint of what to check?
The first thing to check is whether all stuck jobs were mapped to the
same pool account. If so, please remove .globus and .lcgjm directories
inside pool account's home directory and see if it solves the problem.
--
Cheers,
Andrey Kiryanov.
|