Hi Matt
Did you get anywhere with this problem?
Jeremy
On 23 Aug 2012, at 12:25, Matt Doidge wrote:
> Heya all, once again I come to my peers in search of aid.
>
> One of our CEs at Lancaster (the one in front of an LSF cluster), after crashing at the weekend, hasn't been right and is consistently failing the "JobSubmit" tests (although passing all the other tests). The failures are happening on tests from both nagios servers, and other sites aren't seeing this problem, so it's definitely us that's bad. The machine in question is a crusty glite 3.2 cream CE due for a reinstall, but I wasn't planning on upgrading it for a month (partly due needing to understand the risks to the cluster posed my reinstalling a licence holding node).
>
> The server in question is running atlas jobs fine, so it's not inherently broken, and I can't see anything exciting in the logs. The tests seem to get to the point where a jobid is returned, then the tests time out after a few hours. Checking the progress of one of these jobs I see that it lasted a few minutes and completed with a "DONE-OK", and I see nothing exciting leftover in the sandbox.
>
> I thought that perhaps the lb daemons weren't running, but the bnotifier and & bupdater daemons appear to be doing their job - they're running and the relevant logs are updating.
>
> Links to the failed tests:
> https://gridppnagios.lancs.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=abaddon.hec.lancs.ac.uk&service=org.sam.CREAMCE-JobSubmit-%2Fops%2FRole%3Dlcgadmin
> https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/extinfo.cgi?type=2&host=abaddon.hec.lancs.ac.uk&service=org.sam.CREAMCE-JobSubmit-%2Fops%2FRole%3Dlcgadmin
>
>
> Has anyone had this issue with this test before? I'm fairly stumped and would greatly appreciate some help or insight!
>
> Thanks in advance,
> Matt
|