JISCMail - LCG-ROLLOUT Archives

Hi Marteen,

Could someone inform, all the users with EGEE broadcast, in a formal way, that there is a bug on the LCG-RB and that jobs shouldn't be cancelled multiple times ?
It's not the first time we have such behaviour on our RBs, and we have already lost so many times !

Thanks in advance
Cheers
Christine


-----Message d'origine-----
De : LHC Computer Grid - Rollout [mailto:[log in to unmask]] De la part de Maarten Litmaath
Envoyé : vendredi 30 novembre 2007 14:20
À : [log in to unmask]
Objet : Re: [LCG-ROLLOUT] Jobs sent to my RB stay in Waiting State forever

Yannick Patois wrote:

> # rpm -qa | grep -i condor
> ncm-condorconfig-1.0.2-1
> vdt_globus_jobmanager_condor-VDT1.2.2rh9_LCG-3
> condor-lcg-1.1.0-1
> condor-lcgrb-1.0.0-3
> condor-6.7.10-1
> 
> 
> So I believe condor 6.7.10

No.  Ensure you have this in /opt:

-------------------------------------------------------------------------------
lrwxrwxrwx    1 root     root           13 Feb 12  2007 condor -> condor-20.0.7
-------------------------------------------------------------------------------

> Something I did that seems to have "solved" the problem (for now, lets
> hope), that I got from elsewhere:
> 
> - Stopping the proxy-renewal daemon
> - cd /opt/edg/var/spool/edg-wl-renewd
>   rm -f `ls | grep -E '*\.[0-9]+'`

Beware that such an "rm" may screw up many jobs!
At CERN we have not needed to do that since a long time.

> - Starting the daemon again.
> 
> Dont know why, but it seems to help.
> 
> 
> I also went through all daemons to see if some were stopped (some where)
> and I restarted them. But unfortunately I didn't kept track of exactly
> what I did...

It seems the WM had crashed due to a double cancellation of the same job:
as a side effect the proxy-renewal daemon can get into an infinite loop.

In that case you need to keep stopping the PR and restarting both the PR
and the WM until the WM has proceeded beyond the multiple cancellations:
at each restart it advances by one.