Print

Print


Hello Charles,

Yes, that's exactly what I did.  I repeated the steps and followed the gocwiki and your guidelines but still have problems submitting jobs. :-(
Here is what i had/have...

**************************************************

[root@ce root]# cat /var/spool/pbs/torque.cfg
SUBMITFILTER /var/spool/pbs/submit_filter.pl
[root@ce root]# ll /var/spool/pbs/torque.cfg
-rw-r--r--    1 root     root           45 Mar  9 10:17 /var/spool/pbs/torque.cfg
[root@ce root]# ll /var/spool/pbs/submit_filter.pl
-rwxr-xr-x    1 root     root         4072 Mar  8 12:46 /var/spool/pbs/submit_filter.pl
[root@ce root]# su - dteam001
[ce] /home/dteam001 > cat testjob.sh
#!/bin/bash

printf "`hostname`: `pwd`: `date`\n"
[ce] /home/dteam001 > qsub testjob.sh
1326.ce.prd.hp.com
[ce] /home/dteam001 > qstat
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1326.ce          testjob.sh       dteam001                0 E short
[ce] /home/dteam001 > cat testjob.sh.e1326
No value for $TERM and no -T specified
No value for $TERM and no -T specified
[ce] /home/dteam001 > cat testjob.sh.o1326
bh-wn0.prd.hp.com: /home/dteam001: Thu Mar 24 11:09:38 AST 2005
[ce] /home/dteam001 >

***************************************************

It seems that is working, at least locally... But if i try submitting a job from our UI (ui.prd.hp.com) it fails.  The logging info says something like:

***************************************************

Event: Transfer
- dest_host               =    ce.prd.hp.com:2119/jobmanager-pbs
- dest_instance           =    /var/edgwl/logmonitor/CondorG.log/CondorG.1108581538.log
- dest_jobid              =    unavailable
- destination             =    LRMS
- host                    =    rb.prd.hp.com
- reason                  =    Job successfully submitted to Globus
- result                  =    OK
- source                  =    LogMonitor
- src_instance            =    unique
- timestamp               =    Thu Mar 24 15:13:57 2005
- user                    =    /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
        ---
 Event: Running
- host                    =    rb.prd.hp.com
- node                    =    ce.prd.hp.com
- source                  =    LogMonitor
- src_instance            =    unique
- timestamp               =    Thu Mar 24 15:16:21 2005
- user                    =    /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
        ---
 Event: Done
- exit_code               =    1
- host                    =    rb.prd.hp.com
- reason                  =    Cannot read JobWrapper output, both from Condor and from Maradona.
- source                  =    LogMonitor
- src_instance            =    unique
- status_code             =    FAILED
- timestamp               =    Thu Mar 24 15:16:43 2005
- user                    =    /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
        ---
 Event: Resubmission
- host                    =    rb.prd.hp.com
- reason                  =    unavailable
- result                  =    WILLRESUB
- source                  =    LogMonitor
- src_instance            =    unique
- tag                     =    unavailable
- timestamp               =    Thu Mar 24 15:16:43 2005
- user                    =    /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
        ---

******************************************************

I checked the gocwiki and found out about the jobwrapper output, but did not seemed to fix it.

Can somebody help us ?

Thank you!
./MS


-----Original Message-----
From: Charles Loomis [mailto:[log in to unmask]]
Sent: Thu 3/24/2005 1:53 AM
To: Sotomayor, Maniel; LHC Computer Grid - Rollout
Subject: Re: aborted jobs
 
Hi Maniel,

As Maarten pointed out it looks like the submitfilter is not correctly 
configured.  If you use a submit filter (and you need it for MPI 
support), then you must put the full path and name of the script into 
the /var/spool/pbs/torque.cfg file a line like:

SUBMITFILTER /full/path/to/script

 From Maarten's debugging it looks like you have this configuration, but 
either the script doesn't exist or has the wrong permissions.  First 
check to see that it exists (example can be found on the MPI Wiki page). 
  This script *must* be executable for all users.  There is nothing 
sensitive in the script, so permissions like 0755 are best.  If both of 
those are OK, then check that the script actually runs correctly.  This 
you can do with a simple torque job submission.  If the qsub produces no 
errors it should be OK.  This filter is actually run by the qsub command 
with the user's privileges on all job submissions.

Cheers.

Cal



Sotomayor, Maniel wrote:
> Hello,
> 
> I'm having problems after submitting jobs to my cluster.  The jobs 
> successfully execute through qsub after installing MPICH.  I'm having 
> errors when reading jobwrapper output.  I checked the gocwiki that talks 
> about it, but have not solved it yet with them.  I'm attaching the 
> logging info output.  Can you help me solve this ?
> 
> Sincerely,
> ./MS
>