Print

Print


Ah yes.....

Bit more progress....

******  JobID=[https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM684157713]
     Status        = [DONE-FAILED]
     ExitCode      = [W]
     FailureReason = [reason=127]

I'll investigate further....

On 02/05/12 17:17, Stuart Purdie wrote:
> On 2 May 2012, at 17:12, emyr.james wrote:
>
>> Hi,
>>
>> I modified /usr/bin/sge_submit.sh to make it dump the run scripts it generates into /tmp.
> Was just drafting something to suggest exaclty that!
>
>> The problem is that the run scripts contain this line...
>>
>> #$ -q dteam
>>
>> I.e. jobs are going to the dteam queue which doesn't exist. I've created a queue called grid.q in SGE for all the grid jobs and have the following in my site-info.def...
>>
>> QUEUES="grid.q"
>> GRID_Q_GROUP_ENABLE="ops dteam atlas snoplus"
>>
>> I have no idea where it's getting dteam from....
>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-submit -a -r grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-dteam test.jdl
> The 'cream-sge-dteam' part means:
> CREAM
> use the SGE engine type
> send it to the 'dteam' queue
>
> Try
>
> glite-ce-job-submit -a -r grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-grid.q test.jdl
>
> instead
>
>
>> the run script *should* have this line in it...
>>
>> #$ -q grid.q
>>
>> anyone able to shed light on this ?
>>
>> On 02/05/12 16:30, emyr.james wrote:
>>> Hi Daniela,
>>> I switched to using this jdl...
>>>
>>> executable="/bin/sleep";
>>> arguments="1";
>>>
>>> ...and I also cleaned up /etc/passwd and /etc/group (all the grid related stuff was in there twice).
>>>
>>> I'm now getting this in the log...
>>>
>>> 02 May 2012 16:12:15,851 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor (AbstractJobExecutor.java:2411) - (Worker Thread 1) JOB CREAM173523236 STATUS CHANGED: PENDING =>  ABORTED [failureReason=BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:) N/A (jobId = CREAM173523236)] [localUser=dteam154] [delegationId=65c7415e9d0b602cafef21342fb2bb404eafd9a9]
>>> 02 May 2012 16:13:56,840 INFO  org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobSubmissionManager (JobSubmissionManager.java:131) - (TIMER) AcceptNewJobs by script = true
>>>
>>> I followed the link you sent..
>>>
>>> 1.1 and 1.2 are fine
>>>
>>> I'm not sure how to get a valid proxy in /tmp/user.proxy so I can't do this step.
>>>
>>> 1.4 and 1.5 seem fine.
>>>
>>> For 1.6, it works but I see the above log. When I get the status I see this...
>>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-submit -a -r grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-dteam test.jdl
>>> https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582
>>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-status https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582
>>>
>>> ******  JobID=[https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582]
>>>      Status        = [ABORTED]
>>>      ExitCode      = []
>>>      FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:) N/A (jobId = CREAM246546582)]
>>>
>>>
>>> ej59@feynman:~/svn_work/grid/tests$
>>>
>>> So it's failing on job submission. Presumably it's having issues getting the SGE qsub command to work. I've managed to submit jobs to SGE from the box manually by su'ing to a grid user account and running qsub and that worked fine.
>>>
>>> Are there any logs or extra debugging I can enable to get more info on why it's not submitting ?
>>>
>>> Emyr
>>>
>>> On 02/05/12 15:26, Daniela Bauer wrote:
>>>> Hi Emyr,
>>>>
>>>> Usually yaim makes the necessary updates to your /etc/sudoers file
>>>> (there should also be an /etc/sudoers.forcream which is included in
>>>> the standard one), but maybe something went wrong and/or you update
>>>> your sudoers file otherwise and so removed the changes yaim made ?
>>>>
>>>> I recommend this page:
>>>> https://wiki.italiangrid.it/twiki/bin/view/CREAM/TroubleshootingGuide
>>>>
>>>> Cheers,
>>>> Daniela
>>>>
>>>>
>>>>
>>>>
>>>>