Hi,
I've installed a second CE to route jobs to my new SL4 worker nodes but
haven't put it into my site BDII yet.
In principle the new CE should be identical to my old CE, I haven't
tried any mods yet to send jobs only to the SL4 nodes.
I can su to dteam001 and submit jobs to the dteam queue and have them
run OK, I can ssh from the workernodes to the new CE without needing a
password and I can globus-job-run heplnx207.pp.rl.ac.uk /usr/bin/whoami
successfully.
But, when I submit a job via edg-job-submit -r heplnx207.pp.rl.ac.uk it
eventually gets aborted.
edg-job-get-useless-information gives the following reasons:
Event: Done
- exit_code = 1
- host = lcgrb01.gridpp.rl.ac.uk
- reason = Got a job held event, reason: Unspecified
gridmanager error
- source = LogMonitor
- src_instance = unique
- status_code = FAILED
- timestamp = Wed Jul 11 15:22:04 2007
- user = /C=UK/O=eScience/OU=CLRC/L=RAL/CN=chris
dteam brew
---
Event: Done
- exit_code = 1
- host = lcgrb01.gridpp.rl.ac.uk
- reason = Job got an error while in the CondorG
queue.
- source = LogMonitor
- src_instance = unique
- status_code = FAILED
- timestamp = Wed Jul 11 15:22:16 2007
- user = /C=UK/O=eScience/OU=CLRC/L=RAL/CN=chris
dteam brew
As far as I can tell it never even gets into pbs though I do see various
processes being run by dteam001 on the CE and see AuthenticateUser and
StatusJob requests in the pbs_server logs.
Can anyone give me any ideas of where to look for any logging info on
the job submission or any other help?
Thanks,
Chris.
|