Print

Print


Heya all, once again I come to my peers in search of aid.

One of our CEs at Lancaster (the one in front of an LSF cluster), after 
crashing at the weekend, hasn't been right and is consistently failing 
the "JobSubmit" tests (although passing all the other tests). The 
failures are happening on tests from both nagios servers, and other 
sites aren't seeing this problem, so it's definitely us that's bad. The 
machine in question is a crusty glite 3.2 cream CE due for a reinstall, 
but I wasn't planning on upgrading it for a month (partly due needing to 
understand the risks to the cluster posed my reinstalling a licence 
holding node).

The server in question is running atlas jobs fine, so it's not 
inherently broken, and I can't see anything exciting in the logs. The 
tests seem to get to the point where a jobid is returned, then the tests 
time out after a few hours. Checking the progress of one of these jobs I 
see that it lasted a few minutes and completed with a "DONE-OK", and I 
see nothing exciting leftover in the sandbox.

I thought that perhaps the lb daemons weren't running, but the bnotifier 
and & bupdater daemons appear to be doing their job - they're running 
and the relevant logs are updating.

Links to the failed tests:
https://gridppnagios.lancs.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=abaddon.hec.lancs.ac.uk&service=org.sam.CREAMCE-JobSubmit-%2Fops%2FRole%3Dlcgadmin
https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=abaddon.hec.lancs.ac.uk&service=org.sam.CREAMCE-JobSubmit-%2Fops%2FRole%3Dlcgadmin


Has anyone had this issue with this test before? I'm fairly stumped and 
would greatly appreciate some help or insight!

Thanks in advance,
Matt