Hi,
> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
> On Behalf Of Alvise Dorigo
> Sent: 30 September 2011 13:38
> > Restarting tomcat did not seem to fix it, though restarting the node
> did
>
> do you mean the CE node ?
Yes, full reboot of the host.
> > (although since the error is apparently the WMS refusing to submit to
> the
> > CreamCE it is possible that blacklisting expired after my test job
> after
> > restarting tomcat and before my test job after restarting the node).
> > The glite-cream-ce.log shows connections from the WMS in question
> after I
> > submit the job (only for delegation) with no apparent failures but no
> > attempt to submit a job. (See attached fragments of the log file)
>
> Please remember that a CE remain in the Blacklist for 30 minutes (only
> EventQuery is allowed to that CE during this period).
>
Yes, that's one of the things making getting to the bottom of this so hard.
For this current incident I submitted a job to heplnx206.pp.rl.ac.uk via
wms208.cern.ch at 10:19 and had it fail. I then rebooted the node and
submitted another job at 10:31 which succeeded. However We still see SAM
test failures via wms208.cern.ch at 11:15 and 12:15 before it succeeds at
13:15 (http://bit.ly/ppz5tW). After the reboot I made no other interventions
on the CreamCE.
Due to other changes at the site I don't really have the luxury of putting
the node into unscheduled downtime and restarting services one at a time and
waiting a couple of hours to see which one fixes the problem at the moment.
Yours,
Chris.
|