Hello,
Our cream CE, since having a crash the other week, has stopped reporting
job statuses to WMSes. This is causing us to fail ops tests as, although
jobs are successfully submitted and a jobid is returned to the WMS the
WMS never recieves word that the job completes and thus the "JobSubmit"
test eventually times out. Checking the logs shows that the jobs achieve
a "Done-OK" status. I've restarted services, made sure everything is
running (tomcat, BNotifier, BUpdater). Following some anecdotal advice
from this list I even tried rebooting the node, but nothing helps.
The cream CE is an older glite 3.2 version (3.2.10-0) due for a
reinstall soon so it's not worth tearing apart to try to fix it, but on
the other hand we're not in a position to do the upgrade just yet so
would like to try to keep the CE going for a few more weeks. The cream
is in front of an LSF batch system (just to be awkward).
No changes to firewalls or anything like have been done, the cream
crashed (stopped accepting connections, writing to logs or doing
anything) and required a restart of all services.
Any help would be appreciated, it seems like it should be a little thing
that we're missing but I can't for the life of me think what it is, and
I can't find any answers in the usual places (i.e. google and the
lcg-rollout archives).
Thanks in advance,
Matt
|