Hi Douglas
Are you saying that the job is submitted correctly on torque, but it
doesn't run (for whatever reason), and CREAM instead reports that the job
is running ?
This looks like bug
https://savannah.cern.ch/bugs/index.php?45717
Can you check if the pbs log files there is something like "unable to run
job" for that job ?
Cheers, Massimo
On Fri, 20 Feb 2009, Douglas McNab wrote:
> Hi,
>
> I am testing a cream ce that I have set up on a scotgrid dev machine.
> The current setup is cream ce and torque/maui on different hosts with the
> logs mounted via NFS on the CE.
>
> The job gets submitted successfully and is traceable in torque. However, it
> moves from Q to W and waits forever. On deeper investigation it looks to me
> like torque or cream thinks the job is actually running on a job slot where
> another totally different job is running. When running pbsnodes for the
> node id and grepping for the cream job id - nothing is returned. Then
> tailing the torque logs there is another dteam job with the same exec_host
> as the cream job. This leads me to thinking that I may not have set up the
> log parsing correctly on the ce or something is getting thoroughly confused.
>
> The outputs from various commands to come to this conclusion are listed
> below. Any thoughts on this would be greatly appreciated.
>
> svr016:/var/spool/pbs/server_logs#* tracejob 2409813*
> /var/spool/pbs/mom_logs/20090220: No such file or directory
> /var/spool/pbs/sched_logs/20090220: No such file or directory
>
> Job: 2409813.svr016.gla.scotgrid.ac.uk
>
> 02/20/2009 14:36:34 S enqueuing into q30m, state 1 hop 1
> 02/20/2009 14:36:34 S Job Queued at request of
> [log in to unmask], owner =
> [log in to unmask], job name =
> cream_520507277, queue = q30m
> 02/20/2009 14:36:34 A queue=q30m
> 02/20/2009 15:19:37 S Job Modified at request of
> [log in to unmask]
> 02/20/2009 15:19:37 S Job Run at request of
> [log in to unmask]
>
>
> svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
> Job Id: *2409813.svr016.gla.scotgrid.ac.uk*
> Job_Name = cream_520507277
> Job_Owner = [log in to unmask]
> job_state = W
> queue = q30m
> server = svr016.gla.scotgrid.ac.uk
> Checkpoint = u
> ctime = Fri Feb 20 14:36:34 2009
> Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
> *exec_host = node182/2*
> Execution_Time = Fri Feb 20 15:49:41 2009
> ......
>
> svr016:~# *pbsnodes node182 | grep 2409813*
>
> svr016:~# *pbsnodes node182*
> node182
> state = job-exclusive
> np = 8
> properties = lcgpro
> ntype = cluster
> jobs = 0/2406819.svr016.gla.scotgrid.ac.uk, 1/
> 2409176.svr016.gla.scotgrid.ac.uk, 2/2409262.svr016.gla.scotgrid.ac.uk, 3/
> 2340154.svr016.gla.scotgrid.ac.uk, 4/2354251.svr016.gla.scotgrid.ac.uk, 5/
> 2407443.svr016.gla.scotgrid.ac.uk, 6/2408238.svr016.gla.scotgrid.ac.uk, 7/
> 2407591.svr016.gla.scotgrid.ac.uk
> status = opsys=linux,uname=Linux node182.beowulf.cluster
> 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008 x86_64,sessions=30091
> 1175 1856 9228 11111 20888 22885
> 30853,nsessions=8,nusers=4,idletime=3709813,totmem=5952308kb,availmem=2123760kb,physmem=16438780kb,ncpus=8,loadave=8.04,netload=4294967294,state=free,jobs=
> 2340154.svr016.gla.scotgrid.ac.uk 2354251.svr016.gla.scotgrid.ac.uk
> 2406819.svr016.gla.scotgrid.ac.uk 2407443.svr016.gla.scotgrid.ac.uk
> 2407591.svr016.gla.scotgrid.ac.uk 2408238.svr016.gla.scotgrid.ac.uk
> 2409176.svr016.gla.scotgrid.ac.uk *2409262.svr016.gla.scotgrid.ac.uk
> ,rectime=1235143422*
>
> 02/20/2009 15:20:23;S;*2409262.svr016.gla.scotgrid.ac.uk*;user=dteam166
> group=dteam jobname=STDIN queue=q3d ctime=1235130339 qtime=1235130339
> etime=1235130339 start=1235143223
> [log in to unmask] *exec_host=node182/2
> *Resource_List.cput=72:00:00 Resource_List.neednodes=1
> Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=72:00:00
>
>
> Thanks,
>
> Dug
>
>
--
\\\|///
\\ ~ ~ //
(/ @ @ /)
-------oOOo-(_)-oOOo----------------------------------
Massimo Sgaravatto
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy
Tel: ++39 0498277047 Fax: ++39 0498277102
oooO E-mail: massimo.sgaravatto [at] pd.infn.it
( ) Oooo Home page: http://www.pd.infn.it/~sgaravat
--------\ (----( )----------------------------------
\_) ) /
(_/
|