Hi,

I said this because a qstat -f of the job id 2409813 for the cream job in torque shows an exec_host:

svr016:/var/spool/pbs/server_logs# qstat -f 2409813
Job Id: 2409813.svr016.gla.scotgrid.ac.uk
    Job_Name = cream_520507277
    Job_Owner = [log in to unmask]
    job_state = W
    queue = q30m
    server = svr016.gla.scotgrid.ac.uk
    Checkpoint = u
    ctime = Fri Feb 20 14:36:34 2009
    Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
    exec_host = node309/3
    Execution_Time = Fri Feb 20 17:22:52 2009
    Hold_Types = n
    Join_Path = n

But if you look at pbsnodes for that node, node309 and grep for the cream job id, 2409813 it returns nothing.

svr016:~# pbsnodes node309 | grep 2409813

Then looking closer at node309 you see this job: 3/2410148.svr016.gla.scotgrid.ac.uk which is not the cream job

svr016:~# pbsnodes node309
node309
     state = free
     np = 8
     properties = lcgpro
     ntype = cluster
     jobs = 0/2408249.svr016.gla.scotgrid.ac.uk, 1/2376195.svr016.gla.scotgrid.ac.uk, 2/2408647.svr016.gla.scotgrid.ac.uk, 3/2410148.svr016.gla.scotgrid.ac.uk, 4/2408285.svr016.gla.scotgrid.ac.uk, 5/2371288.svr016.gla.scotgrid.ac.uk, 6/2408028.svr016.gla.scotgrid.ac.uk
     status = opsys=linux,uname=Linux node309.beowulf.cluster 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008 x86_64,sessions=32057 2151 2589 24011 19802 29533 27211,nsessions=7,nusers=4,idletime=955188,totmem=5952308kb,availmem=5594108kb,physmem=16438780kb,ncpus=8,loadave=7.11,netload=4294967294,state=free,jobs=2371288.svr016.gla.scotgrid.ac.uk 2376195.svr016.gla.scotgrid.ac.uk 2408028.svr016.gla.scotgrid.ac.uk 2408249.svr016.gla.scotgrid.ac.uk 2408647.svr016.gla.scotgrid.ac.uk 2408285.svr016.gla.scotgrid.ac.uk 2410148.svr016.gla.scotgrid.ac.uk,rectime=1235149229

Then a quick qstat of 2410148 shows that its actually a totally different job.

svr016:~# qstat -f 2410148
Job Id: 2410148.svr016.gla.scotgrid.ac.uk
    Job_Name = STDIN
    Job_Owner = [log in to unmask]
    resources_used.cput = 00:04:14
    resources_used.mem = 468532kb
    resources_used.vmem = 2307700kb
    resources_used.walltime = 00:05:51
    job_state = R
    queue = q30m
    server = svr016.gla.scotgrid.ac.uk
    Checkpoint = u

Perhaps the wait status in Torque is saying that its waiting for 2410148 to finish before running the cream job - not sure.
I just seems strange.  The cream job has now been on the cluster for 2.5 hours and have been moved through various exec_hosts but nothing is happening in terms of actually running.

Cheers,

Dug


2009/2/20 Massimo Sgaravatto - INFN Padova <[log in to unmask]>
If a job has been submitted to torque but it is not running, it is correct
that CREAM reports that the job is in IDLE status.
So why did you say "On deeper investigation it looks to me like torque or cream
thinks the job is actually running on a job slot" ?


Then we have to understand why the job doesn't want to run ...


                               Cheers, Massimo

On Fri, 20 Feb 2009, Douglas McNab wrote:

> Hi Massimo,
>
> What I have seen is that the job looks to have been submitted to torque,
> moves from queued to a waiting state:
>
> svr016:/var/spool/pbs/server_logs# qstat | grep dteam083
> 2409813.svr016            cream_520507277  dteam083               0 Q q30m
>
> svr016:/var/spool/pbs/server_logs# qstat | grep dteam083
> 2409813.svr016            cream_520507277  dteam083               0 W q30m
>
> and then sits there.  It looks from qtstat'ing the job that it try to
> schedule it to a worker node, sets the exec_host but the job never runs.
>
> In terms of the cream ce:
>
> -bash-3.00$ glite-ce-job-status
> https://dev011.gla.scotgrid.ac.uk:8443/CREAM520507277
> 2009-02-20 16:27:01,545 WARN - No configuration file suitable for loading.
> Using built-in configuration
>
> ******  JobID=[https://dev011.gla.scotgrid.ac.uk:8443/CREAM520507277]
>     Status        = [IDLE]
>
> It also looks from tracejob that it keeps rescheduling the job for some
> reason.
>
> 02/20/2009 15:19:37  S    Job Modified at request of
> [log in to unmask]
> 02/20/2009 15:19:37  S    Job Run at request of
> [log in to unmask]
> 02/20/2009 15:50:24  S    Job Modified at request of
> [log in to unmask]
> 02/20/2009 15:50:24  S    Job Run at request of
> [log in to unmask]
> 02/20/2009 16:21:55  S    Job Modified at request of
> [log in to unmask]
> 02/20/2009 16:21:55  S    Job Run at request of
> [log in to unmask]
>
> There is no mention of "unable to run job" in the logs.  It looks like it
> never actually gets that far.
> What does the BLParser actually do?
>
> Cheers,
>
> Dug
>
> 2009/2/20 Massimo Sgaravatto - INFN Padova <[log in to unmask]>
>
> > Hi Douglas
> >
> > Are you saying that the job is submitted correctly on torque, but it
> > doesn't run (for whatever reason), and CREAM instead reports that the job
> > is running ?
> >
> > This looks like bug
> >
> > https://savannah.cern.ch/bugs/index.php?45717
> >
> >
> > Can you check if the pbs log files there is something like "unable to run
> > job" for that job ?
> >
> >                                Cheers, Massimo
> >
> >
> > On Fri, 20 Feb 2009, Douglas McNab wrote:
> >
> > > Hi,
> > >
> > > I am testing a cream ce that I have set up on a scotgrid dev machine.
> > > The current setup is cream ce and torque/maui on different hosts with the
> > > logs mounted via NFS on the CE.
> > >
> > > The job gets submitted successfully and is traceable in torque.  However,
> > it
> > > moves from Q to W and waits forever.  On deeper investigation it looks to
> > me
> > > like torque or cream thinks the job is actually running on a job slot
> > where
> > > another totally different job is running.  When running pbsnodes for the
> > > node id and grepping for the cream job id - nothing is returned.  Then
> > > tailing the torque logs there is another dteam job with the same
> > exec_host
> > > as the cream job.  This leads me to thinking that I may not have set up
> > the
> > > log parsing correctly on the ce or something is getting thoroughly
> > confused.
> > >
> > > The outputs from various commands to come to this conclusion are listed
> > > below.  Any thoughts on this would be greatly appreciated.
> > >
> > > svr016:/var/spool/pbs/server_logs#* tracejob 2409813*
> > > /var/spool/pbs/mom_logs/20090220: No such file or directory
> > > /var/spool/pbs/sched_logs/20090220: No such file or directory
> > >
> > > Job: 2409813.svr016.gla.scotgrid.ac.uk
> > >
> > > 02/20/2009 14:36:34  S    enqueuing into q30m, state 1 hop 1
> > > 02/20/2009 14:36:34  S    Job Queued at request of
> > > [log in to unmask], owner =
> > > [log in to unmask], job name =
> > >                           cream_520507277, queue = q30m
> > > 02/20/2009 14:36:34  A    queue=q30m
> > > 02/20/2009 15:19:37  S    Job Modified at request of
> > > [log in to unmask]
> > > 02/20/2009 15:19:37  S    Job Run at request of
> > > [log in to unmask]
> > >
> > >
> > > svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
> > > Job Id: *2409813.svr016.gla.scotgrid.ac.uk*
> > >     Job_Name = cream_520507277
> > >     Job_Owner = [log in to unmask]
> > >     job_state = W
> > >     queue = q30m
> > >     server = svr016.gla.scotgrid.ac.uk
> > >     Checkpoint = u
> > >     ctime = Fri Feb 20 14:36:34 2009
> > >     Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
> > >     *exec_host = node182/2*
> > >     Execution_Time = Fri Feb 20 15:49:41 2009
> > >     ......
> > >
> > > svr016:~# *pbsnodes node182 | grep 2409813*
> > >
> > > svr016:~# *pbsnodes node182*
> > > node182
> > >      state = job-exclusive
> > >      np = 8
> > >      properties = lcgpro
> > >      ntype = cluster
> > >      jobs = 0/2406819.svr016.gla.scotgrid.ac.uk, 1/
> > > 2409176.svr016.gla.scotgrid.ac.uk, 2/2409262.svr016.gla.scotgrid.ac.uk,
> > 3/
> > > 2340154.svr016.gla.scotgrid.ac.uk, 4/2354251.svr016.gla.scotgrid.ac.uk,
> > 5/
> > > 2407443.svr016.gla.scotgrid.ac.uk, 6/2408238.svr016.gla.scotgrid.ac.uk,
> > 7/
> > > 2407591.svr016.gla.scotgrid.ac.uk
> > >      status = opsys=linux,uname=Linux node182.beowulf.cluster
> > > 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008
> > x86_64,sessions=30091
> > > 1175 1856 9228 11111 20888 22885
> > >
> > 30853,nsessions=8,nusers=4,idletime=3709813,totmem=5952308kb,availmem=2123760kb,physmem=16438780kb,ncpus=8,loadave=8.04,netload=4294967294,state=free,jobs=
> > > 2340154.svr016.gla.scotgrid.ac.uk 2354251.svr016.gla.scotgrid.ac.uk
> > > 2406819.svr016.gla.scotgrid.ac.uk 2407443.svr016.gla.scotgrid.ac.uk
> > > 2407591.svr016.gla.scotgrid.ac.uk 2408238.svr016.gla.scotgrid.ac.uk
> > > 2409176.svr016.gla.scotgrid.ac.uk *2409262.svr016.gla.scotgrid.ac.uk
> > > ,rectime=1235143422*
> > >
> > > 02/20/2009 15:20:23;S;*2409262.svr016.gla.scotgrid.ac.uk*;user=dteam166
> > > group=dteam jobname=STDIN queue=q3d ctime=1235130339 qtime=1235130339
> > > etime=1235130339 start=1235143223
> > > owner=[log in to unmask] *exec_host=node182/2
> > > *Resource_List.cput=72:00:00 Resource_List.neednodes=1
> > > Resource_List.nodect=1 Resource_List.nodes=1
> > Resource_List.walltime=72:00:00
> > >
> > >
> > > Thanks,
> > >
> > > Dug
> > >
> > >
> >
> > --
> >               \\\|///
> >            \\ ~ ~ //
> >            (/ @ @ /)
> >   -------oOOo-(_)-oOOo----------------------------------
> >                         Massimo Sgaravatto
> >                         INFN Sezione di Padova
> >                         Via Marzolo, 8
> >                         35131 Padova - Italy
> >                         Tel: ++39 0498277047   Fax: ++39 0498277102
> >          oooO           E-mail: massimo.sgaravatto [at] pd.infn.it
> >          (   )   Oooo   Home page: http://www.pd.infn.it/~sgaravat<http://www.pd.infn.it/%7Esgaravat>
> >   --------\ (----(   )----------------------------------
> >            \_)    ) /
> >                  (_/
> >
>
>
>
>

--
             \\\|///
           \\ ~ ~ //
           (/ @ @ /)
  -------oOOo-(_)-oOOo----------------------------------
                        Massimo Sgaravatto
                        INFN Sezione di Padova
                        Via Marzolo, 8
                        35131 Padova - Italy
                        Tel: ++39 0498277047   Fax: ++39 0498277102
         oooO           E-mail: massimo.sgaravatto [at] pd.infn.it
         (   )   Oooo   Home page: http://www.pd.infn.it/~sgaravat
  --------\ (----(   )----------------------------------
           \_)    ) /
                 (_/



--
ScotGrid, Room 481, Kelvin Building, University of Glasgow
tel: +44(0)141 330 6439