Hello Charles,
Yes, that's exactly what I did. I repeated the steps and followed the gocwiki and your guidelines but still have problems submitting jobs. :-(
Here is what i had/have...
**************************************************
[root@ce root]# cat /var/spool/pbs/torque.cfg
SUBMITFILTER /var/spool/pbs/submit_filter.pl
[root@ce root]# ll /var/spool/pbs/torque.cfg
-rw-r--r-- 1 root root 45 Mar 9 10:17 /var/spool/pbs/torque.cfg
[root@ce root]# ll /var/spool/pbs/submit_filter.pl
-rwxr-xr-x 1 root root 4072 Mar 8 12:46 /var/spool/pbs/submit_filter.pl
[root@ce root]# su - dteam001
[ce] /home/dteam001 > cat testjob.sh
#!/bin/bash
printf "`hostname`: `pwd`: `date`\n"
[ce] /home/dteam001 > qsub testjob.sh
1326.ce.prd.hp.com
[ce] /home/dteam001 > qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1326.ce testjob.sh dteam001 0 E short
[ce] /home/dteam001 > cat testjob.sh.e1326
No value for $TERM and no -T specified
No value for $TERM and no -T specified
[ce] /home/dteam001 > cat testjob.sh.o1326
bh-wn0.prd.hp.com: /home/dteam001: Thu Mar 24 11:09:38 AST 2005
[ce] /home/dteam001 >
***************************************************
It seems that is working, at least locally... But if i try submitting a job from our UI (ui.prd.hp.com) it fails. The logging info says something like:
***************************************************
Event: Transfer
- dest_host = ce.prd.hp.com:2119/jobmanager-pbs
- dest_instance = /var/edgwl/logmonitor/CondorG.log/CondorG.1108581538.log
- dest_jobid = unavailable
- destination = LRMS
- host = rb.prd.hp.com
- reason = Job successfully submitted to Globus
- result = OK
- source = LogMonitor
- src_instance = unique
- timestamp = Thu Mar 24 15:13:57 2005
- user = /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
---
Event: Running
- host = rb.prd.hp.com
- node = ce.prd.hp.com
- source = LogMonitor
- src_instance = unique
- timestamp = Thu Mar 24 15:16:21 2005
- user = /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
---
Event: Done
- exit_code = 1
- host = rb.prd.hp.com
- reason = Cannot read JobWrapper output, both from Condor and from Maradona.
- source = LogMonitor
- src_instance = unique
- status_code = FAILED
- timestamp = Thu Mar 24 15:16:43 2005
- user = /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
---
Event: Resubmission
- host = rb.prd.hp.com
- reason = unavailable
- result = WILLRESUB
- source = LogMonitor
- src_instance = unique
- tag = unavailable
- timestamp = Thu Mar 24 15:16:43 2005
- user = /C=PR/O=HP-PR/OU=HPTC/CN=Maniel [log in to unmask]
---
******************************************************
I checked the gocwiki and found out about the jobwrapper output, but did not seemed to fix it.
Can somebody help us ?
Thank you!
./MS
-----Original Message-----
From: Charles Loomis [mailto:[log in to unmask]]
Sent: Thu 3/24/2005 1:53 AM
To: Sotomayor, Maniel; LHC Computer Grid - Rollout
Subject: Re: aborted jobs
Hi Maniel,
As Maarten pointed out it looks like the submitfilter is not correctly
configured. If you use a submit filter (and you need it for MPI
support), then you must put the full path and name of the script into
the /var/spool/pbs/torque.cfg file a line like:
SUBMITFILTER /full/path/to/script
From Maarten's debugging it looks like you have this configuration, but
either the script doesn't exist or has the wrong permissions. First
check to see that it exists (example can be found on the MPI Wiki page).
This script *must* be executable for all users. There is nothing
sensitive in the script, so permissions like 0755 are best. If both of
those are OK, then check that the script actually runs correctly. This
you can do with a simple torque job submission. If the qsub produces no
errors it should be OK. This filter is actually run by the qsub command
with the user's privileges on all job submissions.
Cheers.
Cal
Sotomayor, Maniel wrote:
> Hello,
>
> I'm having problems after submitting jobs to my cluster. The jobs
> successfully execute through qsub after installing MPICH. I'm having
> errors when reading jobwrapper output. I checked the gocwiki that talks
> about it, but have not solved it yet with them. I'm attaching the
> logging info output. Can you help me solve this ?
>
> Sincerely,
> ./MS
>
|