Heya all, once again I come to my peers in search of aid.
One of our CEs at Lancaster (the one in front of an LSF cluster), after
crashing at the weekend, hasn't been right and is consistently failing
the "JobSubmit" tests (although passing all the other tests). The
failures are happening on tests from both nagios servers, and other
sites aren't seeing this problem, so it's definitely us that's bad. The
machine in question is a crusty glite 3.2 cream CE due for a reinstall,
but I wasn't planning on upgrading it for a month (partly due needing to
understand the risks to the cluster posed my reinstalling a licence
holding node).
The server in question is running atlas jobs fine, so it's not
inherently broken, and I can't see anything exciting in the logs. The
tests seem to get to the point where a jobid is returned, then the tests
time out after a few hours. Checking the progress of one of these jobs I
see that it lasted a few minutes and completed with a "DONE-OK", and I
see nothing exciting leftover in the sandbox.
I thought that perhaps the lb daemons weren't running, but the bnotifier
and & bupdater daemons appear to be doing their job - they're running
and the relevant logs are updating.
Links to the failed tests:
https://gridppnagios.lancs.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=abaddon.hec.lancs.ac.uk&service=org.sam.CREAMCE-JobSubmit-%2Fops%2FRole%3Dlcgadmin
https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=abaddon.hec.lancs.ac.uk&service=org.sam.CREAMCE-JobSubmit-%2Fops%2FRole%3Dlcgadmin
Has anyone had this issue with this test before? I'm fairly stumped and
would greatly appreciate some help or insight!
Thanks in advance,
Matt
|