> It looks rather like there's a problem with each WMS, but inverted; where the Glasgow one(s) can't submit properly to CREAM, and the RAL one can't submit properly (maybe MyProxy issues?) to lcg-CE's? I can't find a Pheno ticket that contradict's that theory, anyone got anything along those lines?
And I am still scanning logs to see why pheno jobs sent to our CE01 succeed while those sent to our CE02 don't.
One CE02 job output has been kindly provided and the failure message is (with some minimal context):
> Printing info for the Job : https://svr023.gla.scotgrid.ac.uk:9000/sLFEx9fZF768r58rsarKHA
> Event: RegJob
> - Arrived = Mon Jan 31 20:23:10 2011 GMT
> - Host = svr023.gla.scotgrid.ac.uk
> ....
> Event: Transfer
> - Arrived = Tue Feb 1 05:20:16 2011 GMT
> - Dest host = ce02.dur.scotgrid.ac.uk:2119/jobmanager-lcgpbs
> - Dest instance = /var/glite/logmonitor/CondorG.log/CondorG.1296508192.log
> - Dest jobid = unavailable
> - Destination = LRMS
> - Host = svr023.gla.scotgrid.ac.uk
> ....
> - Job = (queue=q2d)(jobtype=single)(environment=(EDG_WL_JOBID 'https://svr023.gla.scotgrid.ac.uk:9000/sLFEx9fZF768r58rsarKHA'))
> ---
> Event: Done
> - Arrived = Tue Feb 1 05:33:53 2011 GMT
> - Exit code = 1
> - Host = svr023.gla.scotgrid.ac.uk
> - Level = SYSTEM
> - Priority = asynchronous
> - Reason = Got a job held event, reason: Globus error 21: the job manager failed to locate an internal script argument file
On the logs on CE02 I see that the job is resubmitted about 5 times between 20:23 and 05:20. Also, this page http://www.globus.org/toolkit/docs/5.0/5.0.2/user/ lists some conditions for "error 21" and they don't seem to apply, or at least they should apply to both CE01 and CE02. But I may be missing something. It may be that the WMSes send slightly different jobs (job wrappers) to our CE01 and CE02.
|