Ah yes..... Bit more progress.... ****** JobID=[https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM684157713] Status = [DONE-FAILED] ExitCode = [W] FailureReason = [reason=127] I'll investigate further.... On 02/05/12 17:17, Stuart Purdie wrote: > On 2 May 2012, at 17:12, emyr.james wrote: > >> Hi, >> >> I modified /usr/bin/sge_submit.sh to make it dump the run scripts it generates into /tmp. > Was just drafting something to suggest exaclty that! > >> The problem is that the run scripts contain this line... >> >> #$ -q dteam >> >> I.e. jobs are going to the dteam queue which doesn't exist. I've created a queue called grid.q in SGE for all the grid jobs and have the following in my site-info.def... >> >> QUEUES="grid.q" >> GRID_Q_GROUP_ENABLE="ops dteam atlas snoplus" >> >> I have no idea where it's getting dteam from.... >> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-submit -a -r grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-dteam test.jdl > The 'cream-sge-dteam' part means: > CREAM > use the SGE engine type > send it to the 'dteam' queue > > Try > > glite-ce-job-submit -a -r grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-grid.q test.jdl > > instead > > >> the run script *should* have this line in it... >> >> #$ -q grid.q >> >> anyone able to shed light on this ? >> >> On 02/05/12 16:30, emyr.james wrote: >>> Hi Daniela, >>> I switched to using this jdl... >>> >>> executable="/bin/sleep"; >>> arguments="1"; >>> >>> ...and I also cleaned up /etc/passwd and /etc/group (all the grid related stuff was in there twice). >>> >>> I'm now getting this in the log... >>> >>> 02 May 2012 16:12:15,851 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2411) - (Worker Thread 1) JOB CREAM173523236 STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:) N/A (jobId = CREAM173523236)] [localUser=dteam154] [delegationId=65c7415e9d0b602cafef21342fb2bb404eafd9a9] >>> 02 May 2012 16:13:56,840 INFO org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobSubmissionManager (JobSubmissionManager.java:131) - (TIMER) AcceptNewJobs by script = true >>> >>> I followed the link you sent.. >>> >>> 1.1 and 1.2 are fine >>> >>> I'm not sure how to get a valid proxy in /tmp/user.proxy so I can't do this step. >>> >>> 1.4 and 1.5 seem fine. >>> >>> For 1.6, it works but I see the above log. When I get the status I see this... >>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-submit -a -r grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-dteam test.jdl >>> https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582 >>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-status https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582 >>> >>> ****** JobID=[https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582] >>> Status = [ABORTED] >>> ExitCode = [] >>> FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:) N/A (jobId = CREAM246546582)] >>> >>> >>> ej59@feynman:~/svn_work/grid/tests$ >>> >>> So it's failing on job submission. Presumably it's having issues getting the SGE qsub command to work. I've managed to submit jobs to SGE from the box manually by su'ing to a grid user account and running qsub and that worked fine. >>> >>> Are there any logs or extra debugging I can enable to get more info on why it's not submitting ? >>> >>> Emyr >>> >>> On 02/05/12 15:26, Daniela Bauer wrote: >>>> Hi Emyr, >>>> >>>> Usually yaim makes the necessary updates to your /etc/sudoers file >>>> (there should also be an /etc/sudoers.forcream which is included in >>>> the standard one), but maybe something went wrong and/or you update >>>> your sudoers file otherwise and so removed the changes yaim made ? >>>> >>>> I recommend this page: >>>> https://wiki.italiangrid.it/twiki/bin/view/CREAM/TroubleshootingGuide >>>> >>>> Cheers, >>>> Daniela >>>> >>>> >>>> >>>> >>>>