Ok...there is stuff appearing in the homedir of thar pool account...
ej59@feynman:/mnt/lustre/grid/users/dteam154$ ls -lt
total 16
-rw-r--r-- 1 70154 70001 81 May 2 17:25 cream_922580502.e804176
-rw-r--r-- 1 70154 70001 0 May 2 17:25 cream_922580502.o804176
-rw-r--r-- 1 70154 70001 137 May 2 17:25 err_cream_922580502_StandardError
-rw-r--r-- 1 70154 70001 0 May 2 17:25 out_cream_922580502_StandardOutput
-rw-r--r-- 1 70154 70001 81 May 2 17:20 cream_684157713.e804175
-rw-r--r-- 1 70154 70001 137 May 2 17:20 err_cream_684157713_StandardError
-rw-r--r-- 1 70154 70001 0 May 2 17:20 out_cream_684157713_StandardOutput
-rw-r--r-- 1 70154 70001 0 May 2 17:20 cream_684157713.o804175
ej59@feynman:/mnt/lustre/grid/users/dteam154$ cat cream_922580502.e804176
chmod: cannot access `./CREAM922580502_jobWrapper.sh': No such file or
directory
So cream doesn't seem to be staging the job wrappers correctly. There is
also this...
ej59@feynman:/mnt/lustre/grid/users/dteam154$ cat
err_cream_922580502_StandardError
/cm/shared/apps/sge/current/default/spool/node202/job_scripts/804176:
line 39: ./CREAM922580502_jobWrapper.sh: No such file or directory
One thing I should mention...
I have the pool account home directories created in our lustre
filesystem BUT I don't have lustre mounted on the cream box itself so
the home directories aren't actually visible from there. I guess that
could be the issue ?
I can arrange for the cream box to be set up essentially in the same way
as our storm box which does have that directory mounted but this is
going to be a little inconvenient. Will I have to do that anyway ?
On 02/05/12 17:24, emyr.james wrote:
> Ah yes.....
>
> Bit more progress....
>
> ****** JobID=[https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM684157713]
> Status = [DONE-FAILED]
> ExitCode = [W]
> FailureReason = [reason=127]
>
> I'll investigate further....
>
> On 02/05/12 17:17, Stuart Purdie wrote:
>> On 2 May 2012, at 17:12, emyr.james wrote:
>>
>>> Hi,
>>>
>>> I modified /usr/bin/sge_submit.sh to make it dump the run scripts it
>>> generates into /tmp.
>> Was just drafting something to suggest exaclty that!
>>
>>> The problem is that the run scripts contain this line...
>>>
>>> #$ -q dteam
>>>
>>> I.e. jobs are going to the dteam queue which doesn't exist. I've
>>> created a queue called grid.q in SGE for all the grid jobs and have
>>> the following in my site-info.def...
>>>
>>> QUEUES="grid.q"
>>> GRID_Q_GROUP_ENABLE="ops dteam atlas snoplus"
>>>
>>> I have no idea where it's getting dteam from....
>>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-submit -a -r
>>> grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-dteam test.jdl
>> The 'cream-sge-dteam' part means:
>> CREAM
>> use the SGE engine type
>> send it to the 'dteam' queue
>>
>> Try
>>
>> glite-ce-job-submit -a -r
>> grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-grid.q test.jdl
>>
>> instead
>>
>>
>>> the run script *should* have this line in it...
>>>
>>> #$ -q grid.q
>>>
>>> anyone able to shed light on this ?
>>>
>>> On 02/05/12 16:30, emyr.james wrote:
>>>> Hi Daniela,
>>>> I switched to using this jdl...
>>>>
>>>> executable="/bin/sleep";
>>>> arguments="1";
>>>>
>>>> ...and I also cleaned up /etc/passwd and /etc/group (all the grid
>>>> related stuff was in there twice).
>>>>
>>>> I'm now getting this in the log...
>>>>
>>>> 02 May 2012 16:12:15,851 INFO
>>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>>> (AbstractJobExecutor.java:2411) - (Worker Thread 1) JOB
>>>> CREAM173523236 STATUS CHANGED: PENDING => ABORTED
>>>> [failureReason=BLAH error: submission command failed (exit code =
>>>> 1) (stdout:) (stderr:) N/A (jobId = CREAM173523236)]
>>>> [localUser=dteam154]
>>>> [delegationId=65c7415e9d0b602cafef21342fb2bb404eafd9a9]
>>>> 02 May 2012 16:13:56,840 INFO
>>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.JobSubmissionManager (JobSubmissionManager.java:131)
>>>> - (TIMER) AcceptNewJobs by script = true
>>>>
>>>> I followed the link you sent..
>>>>
>>>> 1.1 and 1.2 are fine
>>>>
>>>> I'm not sure how to get a valid proxy in /tmp/user.proxy so I can't
>>>> do this step.
>>>>
>>>> 1.4 and 1.5 seem fine.
>>>>
>>>> For 1.6, it works but I see the above log. When I get the status I
>>>> see this...
>>>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-submit -a -r
>>>> grid-cream-01.hpc.susx.ac.uk:8443/cream-sge-dteam test.jdl
>>>> https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582
>>>> ej59@feynman:~/svn_work/grid/tests$ glite-ce-job-status
>>>> https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582
>>>>
>>>> ******
>>>> JobID=[https://grid-cream-01.hpc.susx.ac.uk:8443/CREAM246546582]
>>>> Status = [ABORTED]
>>>> ExitCode = []
>>>> FailureReason = [BLAH error: submission command failed (exit
>>>> code = 1) (stdout:) (stderr:) N/A (jobId = CREAM246546582)]
>>>>
>>>>
>>>> ej59@feynman:~/svn_work/grid/tests$
>>>>
>>>> So it's failing on job submission. Presumably it's having issues
>>>> getting the SGE qsub command to work. I've managed to submit jobs
>>>> to SGE from the box manually by su'ing to a grid user account and
>>>> running qsub and that worked fine.
>>>>
>>>> Are there any logs or extra debugging I can enable to get more info
>>>> on why it's not submitting ?
>>>>
>>>> Emyr
>>>>
>>>> On 02/05/12 15:26, Daniela Bauer wrote:
>>>>> Hi Emyr,
>>>>>
>>>>> Usually yaim makes the necessary updates to your /etc/sudoers file
>>>>> (there should also be an /etc/sudoers.forcream which is included in
>>>>> the standard one), but maybe something went wrong and/or you update
>>>>> your sudoers file otherwise and so removed the changes yaim made ?
>>>>>
>>>>> I recommend this page:
>>>>> https://wiki.italiangrid.it/twiki/bin/view/CREAM/TroubleshootingGuide
>>>>>
>>>>> Cheers,
>>>>> Daniela
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>
|