Hi All,
After I thought I had finally (again) got our CREAM CE and Torque/Maui
grid farm working, I have a different problem, or possibly two, this
morning.
(1) The ATLAS and ops Nagios pages both show aborted jobs. The reason is
apparently this:
- Reason = BLAH error: submission command failed (exit code = 1)
(stdout:) (stderr:qsub: Undefined attribute MSG=detected presence of an
unknown attribute-) N/A (jobId = CREAM827234778)
I don't know what attribute is being used that is undefined, and don't
immediately know where to look.
(2) The WNs auto-updated (perhaps not such a good idea) overnight and
got new versions of Torque and Munge, and now the WNs cannot run pbsnodes:
[root@lcg-wn03 ~]# pbsnodes
munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such
file or directory
pbsnodes: End of File
I don't (as far as I know) use Munge, but it is a dependency of the
Torque RPMs. Strangely, pbsnodes works fine on the CE, which is also the
Torque batch server.
Any ideas?
Thanks,
Ben
--
Dr Ben Waugh Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy Internal: 37223
University College London
London WC1E 6BT
|