2009/2/20 Massimo Sgaravatto - INFN Padova <[log in to unmask]>

If a job has been submitted to torque but it is not running, it is correct
that CREAM reports that the job is in IDLE status.
So why did you say "On deeper investigation it looks to me like torque or cream
thinks the job is actually running on a job slot" ?

Then we have to understand why the job doesn't want to run ...

Cheers, Massimo

On Fri, 20 Feb 2009, Douglas McNab wrote:

> Hi Massimo,
>
> What I have seen is that the job looks to have been submitted to torque,
> moves from queued to a waiting state:
>
> svr016:/var/spool/pbs/server_logs# qstat | grep dteam083
> 2409813.svr016 cream_520507277 dteam083 0 Q q30m
>
> svr016:/var/spool/pbs/server_logs# qstat | grep dteam083
> 2409813.svr016 cream_520507277 dteam083 0 W q30m
>
> and then sits there. It looks from qtstat'ing the job that it try to
> schedule it to a worker node, sets the exec_host but the job never runs.
>
> In terms of the cream ce:
>
> -bash-3.00$ glite-ce-job-status
> https://dev011.gla.scotgrid.ac.uk:8443/CREAM520507277
> 2009-02-20 16:27:01,545 WARN - No configuration file suitable for loading.
> Using built-in configuration
>
> ****** JobID=[https://dev011.gla.scotgrid.ac.uk:8443/CREAM520507277]
> Status = [IDLE]
>
> It also looks from tracejob that it keeps rescheduling the job for some
> reason.
>
> 02/20/2009 15:19:37 S Job Modified at request of
> [log in to unmask]
> 02/20/2009 15:19:37 S Job Run at request of
> [log in to unmask]
> 02/20/2009 15:50:24 S Job Modified at request of
> [log in to unmask]
> 02/20/2009 15:50:24 S Job Run at request of
> [log in to unmask]
> 02/20/2009 16:21:55 S Job Modified at request of
> [log in to unmask]
> 02/20/2009 16:21:55 S Job Run at request of
> [log in to unmask]
>
> There is no mention of "unable to run job" in the logs. It looks like it
> never actually gets that far.
> What does the BLParser actually do?
>
> Cheers,
>
> Dug
>
> 2009/2/20 Massimo Sgaravatto - INFN Padova <[log in to unmask]>
>
> > Hi Douglas
> >
> > Are you saying that the job is submitted correctly on torque, but it
> > doesn't run (for whatever reason), and CREAM instead reports that the job
> > is running ?
> >
> > This looks like bug
> >
> > https://savannah.cern.ch/bugs/index.php?45717
> >
> >
> > Can you check if the pbs log files there is something like "unable to run
> > job" for that job ?
> >
> > Cheers, Massimo
> >
> >
> > On Fri, 20 Feb 2009, Douglas McNab wrote:
> >
> > > Hi,
> > >
> > > I am testing a cream ce that I have set up on a scotgrid dev machine.
> > > The current setup is cream ce and torque/maui on different hosts with the
> > > logs mounted via NFS on the CE.
> > >
> > > The job gets submitted successfully and is traceable in torque. However,
> > it
> > > moves from Q to W and waits forever. On deeper investigation it looks to
> > me
> > > like torque or cream thinks the job is actually running on a job slot
> > where
> > > another totally different job is running. When running pbsnodes for the
> > > node id and grepping for the cream job id - nothing is returned. Then
> > > tailing the torque logs there is another dteam job with the same
> > exec_host
> > > as the cream job. This leads me to thinking that I may not have set up
> > the
> > > log parsing correctly on the ce or something is getting thoroughly
> > confused.
> > >
> > > The outputs from various commands to come to this conclusion are listed
> > > below. Any thoughts on this would be greatly appreciated.
> > >
> > > svr016:/var/spool/pbs/server_logs#* tracejob 2409813*
> > > /var/spool/pbs/mom_logs/20090220: No such file or directory
> > > /var/spool/pbs/sched_logs/20090220: No such file or directory
> > >
> > > Job: 2409813.svr016.gla.scotgrid.ac.uk
> > >
> > > 02/20/2009 14:36:34 S enqueuing into q30m, state 1 hop 1
> > > 02/20/2009 14:36:34 S Job Queued at request of
> > > [log in to unmask], owner =
> > > [log in to unmask], job name =
> > > cream_520507277, queue = q30m
> > > 02/20/2009 14:36:34 A queue=q30m
> > > 02/20/2009 15:19:37 S Job Modified at request of
> > > [log in to unmask]
> > > 02/20/2009 15:19:37 S Job Run at request of
> > > [log in to unmask]
> > >
> > >
> > > svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
> > > Job Id: *2409813.svr016.gla.scotgrid.ac.uk*
> > > Job_Name = cream_520507277
> > > Job_Owner = [log in to unmask]
> > > job_state = W
> > > queue = q30m
> > > server = svr016.gla.scotgrid.ac.uk
> > > Checkpoint = u
> > > ctime = Fri Feb 20 14:36:34 2009
> > > Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
> > > *exec_host = node182/2*
> > > Execution_Time = Fri Feb 20 15:49:41 2009
> > > ......
> > >
> > > svr016:~# *pbsnodes node182 | grep 2409813*
> > >
> > > svr016:~# *pbsnodes node182*
> > > node182
> > > state = job-exclusive
> > > np = 8
> > > properties = lcgpro
> > > ntype = cluster
> > > jobs = 0/2406819.svr016.gla.scotgrid.ac.uk, 1/
> > > 2409176.svr016.gla.scotgrid.ac.uk, 2/2409262.svr016.gla.scotgrid.ac.uk,
> > 3/
> > > 2340154.svr016.gla.scotgrid.ac.uk, 4/2354251.svr016.gla.scotgrid.ac.uk,
> > 5/
> > > 2407443.svr016.gla.scotgrid.ac.uk, 6/2408238.svr016.gla.scotgrid.ac.uk,
> > 7/
> > > 2407591.svr016.gla.scotgrid.ac.uk
> > > status = opsys=linux,uname=Linux node182.beowulf.cluster
> > > 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008
> > x86_64,sessions=30091
> > > 1175 1856 9228 11111 20888 22885
> > >
> > 30853,nsessions=8,nusers=4,idletime=3709813,totmem=5952308kb,availmem=2123760kb,physmem=16438780kb,ncpus=8,loadave=8.04,netload=4294967294,state=free,jobs=
> > > 2340154.svr016.gla.scotgrid.ac.uk 2354251.svr016.gla.scotgrid.ac.uk
> > > 2406819.svr016.gla.scotgrid.ac.uk 2407443.svr016.gla.scotgrid.ac.uk
> > > 2407591.svr016.gla.scotgrid.ac.uk 2408238.svr016.gla.scotgrid.ac.uk
> > > 2409176.svr016.gla.scotgrid.ac.uk *2409262.svr016.gla.scotgrid.ac.uk
> > > ,rectime=1235143422*
> > >
> > > 02/20/2009 15:20:23;S;*2409262.svr016.gla.scotgrid.ac.uk*;user=dteam166
> > > group=dteam jobname=STDIN queue=q3d ctime=1235130339 qtime=1235130339
> > > etime=1235130339 start=1235143223
> > > owner=[log in to unmask] *exec_host=node182/2
> > > *Resource_List.cput=72:00:00 Resource_List.neednodes=1
> > > Resource_List.nodect=1 Resource_List.nodes=1
> > Resource_List.walltime=72:00:00
> > >
> > >
> > > Thanks,
> > >
> > > Dug
> > >
> > >
> >
> > --
> > \\\|///
> > \\ ~ ~ //
> > (/ @ @ /)
> > -------oOOo-(_)-oOOo----------------------------------
> > Massimo Sgaravatto
> > INFN Sezione di Padova
> > Via Marzolo, 8
> > 35131 Padova - Italy
> > Tel: ++39 0498277047 Fax: ++39 0498277102
> > oooO E-mail: massimo.sgaravatto [at] pd.infn.it

> > ( ) Oooo Home page: http://www.pd.infn.it/~sgaravat<http://www.pd.infn.it/%7Esgaravat>
> > --------\ (----( )----------------------------------
> > \_) ) /
> > (_/
> >
>
>
>
>

--

\\\|///
\\ ~ ~ //
(/ @ @ /)
-------oOOo-(_)-oOOo----------------------------------
Massimo Sgaravatto
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy
Tel: ++39 0498277047 Fax: ++39 0498277102
oooO E-mail: massimo.sgaravatto [at] pd.infn.it
( ) Oooo Home page: http://www.pd.infn.it/~sgaravat
--------\ (----( )----------------------------------
\_) ) /
(_/