Print

Print


Hi Douglas

Are you saying that the job is submitted correctly on torque, but it 
doesn't run (for whatever reason), and CREAM instead reports that the job 
is running ?

This looks like bug

https://savannah.cern.ch/bugs/index.php?45717


Can you check if the pbs log files there is something like "unable to run 
job" for that job ?

				Cheers, Massimo


On Fri, 20 Feb 2009, Douglas McNab wrote:

> Hi,
> 
> I am testing a cream ce that I have set up on a scotgrid dev machine.
> The current setup is cream ce and torque/maui on different hosts with the
> logs mounted via NFS on the CE.
> 
> The job gets submitted successfully and is traceable in torque.  However, it
> moves from Q to W and waits forever.  On deeper investigation it looks to me
> like torque or cream thinks the job is actually running on a job slot where
> another totally different job is running.  When running pbsnodes for the
> node id and grepping for the cream job id - nothing is returned.  Then
> tailing the torque logs there is another dteam job with the same exec_host
> as the cream job.  This leads me to thinking that I may not have set up the
> log parsing correctly on the ce or something is getting thoroughly confused.
> 
> The outputs from various commands to come to this conclusion are listed
> below.  Any thoughts on this would be greatly appreciated.
> 
> svr016:/var/spool/pbs/server_logs#* tracejob 2409813*
> /var/spool/pbs/mom_logs/20090220: No such file or directory
> /var/spool/pbs/sched_logs/20090220: No such file or directory
> 
> Job: 2409813.svr016.gla.scotgrid.ac.uk
> 
> 02/20/2009 14:36:34  S    enqueuing into q30m, state 1 hop 1
> 02/20/2009 14:36:34  S    Job Queued at request of
> [log in to unmask], owner =
> [log in to unmask], job name =
>                           cream_520507277, queue = q30m
> 02/20/2009 14:36:34  A    queue=q30m
> 02/20/2009 15:19:37  S    Job Modified at request of
> [log in to unmask]
> 02/20/2009 15:19:37  S    Job Run at request of
> [log in to unmask]
> 
> 
> svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
> Job Id: *2409813.svr016.gla.scotgrid.ac.uk*
>     Job_Name = cream_520507277
>     Job_Owner = [log in to unmask]
>     job_state = W
>     queue = q30m
>     server = svr016.gla.scotgrid.ac.uk
>     Checkpoint = u
>     ctime = Fri Feb 20 14:36:34 2009
>     Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
>     *exec_host = node182/2*
>     Execution_Time = Fri Feb 20 15:49:41 2009
>     ......
> 
> svr016:~# *pbsnodes node182 | grep 2409813*
> 
> svr016:~# *pbsnodes node182*
> node182
>      state = job-exclusive
>      np = 8
>      properties = lcgpro
>      ntype = cluster
>      jobs = 0/2406819.svr016.gla.scotgrid.ac.uk, 1/
> 2409176.svr016.gla.scotgrid.ac.uk, 2/2409262.svr016.gla.scotgrid.ac.uk, 3/
> 2340154.svr016.gla.scotgrid.ac.uk, 4/2354251.svr016.gla.scotgrid.ac.uk, 5/
> 2407443.svr016.gla.scotgrid.ac.uk, 6/2408238.svr016.gla.scotgrid.ac.uk, 7/
> 2407591.svr016.gla.scotgrid.ac.uk
>      status = opsys=linux,uname=Linux node182.beowulf.cluster
> 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008 x86_64,sessions=30091
> 1175 1856 9228 11111 20888 22885
> 30853,nsessions=8,nusers=4,idletime=3709813,totmem=5952308kb,availmem=2123760kb,physmem=16438780kb,ncpus=8,loadave=8.04,netload=4294967294,state=free,jobs=
> 2340154.svr016.gla.scotgrid.ac.uk 2354251.svr016.gla.scotgrid.ac.uk
> 2406819.svr016.gla.scotgrid.ac.uk 2407443.svr016.gla.scotgrid.ac.uk
> 2407591.svr016.gla.scotgrid.ac.uk 2408238.svr016.gla.scotgrid.ac.uk
> 2409176.svr016.gla.scotgrid.ac.uk *2409262.svr016.gla.scotgrid.ac.uk
> ,rectime=1235143422*
> 
> 02/20/2009 15:20:23;S;*2409262.svr016.gla.scotgrid.ac.uk*;user=dteam166
> group=dteam jobname=STDIN queue=q3d ctime=1235130339 qtime=1235130339
> etime=1235130339 start=1235143223
> [log in to unmask] *exec_host=node182/2
> *Resource_List.cput=72:00:00 Resource_List.neednodes=1
> Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=72:00:00
> 
> 
> Thanks,
> 
> Dug
> 
> 

-- 
              \\\|///
            \\ ~ ~ //
            (/ @ @ /)
   -------oOOo-(_)-oOOo----------------------------------
                         Massimo Sgaravatto
                         INFN Sezione di Padova
                         Via Marzolo, 8
                         35131 Padova - Italy  
                         Tel: ++39 0498277047   Fax: ++39 0498277102
          oooO           E-mail: massimo.sgaravatto [at] pd.infn.it
          (   )   Oooo   Home page: http://www.pd.infn.it/~sgaravat
   --------\ (----(   )----------------------------------
            \_)    ) /
                  (_/