Print

Print


Hi,

> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
> On Behalf Of Alvise Dorigo
> Sent: 30 September 2011 13:38


> > Restarting tomcat did not seem to fix it, though restarting the node
> did
> 
> do you mean the CE node ?

Yes, full reboot of the host.

> > (although since the error is apparently the WMS refusing to submit to
> the
> > CreamCE it is possible that blacklisting expired after my test job
> after
> > restarting tomcat and before my test job after restarting the node).
> > The glite-cream-ce.log shows connections from the WMS in question
> after I
> > submit the job (only for delegation) with no apparent failures but no
> > attempt to submit a job. (See attached fragments of the log file)
> 
> Please remember that a CE remain in the Blacklist for 30 minutes (only
> EventQuery is allowed to that CE during this period).
> 

Yes, that's one of the things making getting to the bottom of this so hard.

For this current incident I submitted a job to heplnx206.pp.rl.ac.uk via
wms208.cern.ch at 10:19 and had it fail. I then rebooted the node and
submitted another job at 10:31 which succeeded. However We still see SAM
test failures via wms208.cern.ch at 11:15 and 12:15 before it succeeds at
13:15 (http://bit.ly/ppz5tW). After the reboot I made no other interventions
on the CreamCE.

Due to other changes at the site I don't really have the luxury of putting
the node into unscheduled downtime and restarting services one at a time and
waiting a couple of hours to see which one fixes the problem at the moment.

Yours,
Chris.