I am not very expert on Torque, but as far as I understand (I might be
completely wrong), if a job is not running but you see something in
the "exec host" field, it is because the job tried to run there.
What does
checkjob 2409813
report ?
It should say why the job doesn't want to run
Cheers, Massimo
On Fri, 20 Feb 2009, Douglas McNab wrote:
> Hi,
>
> I said this because a qstat -f of the job id *2409813* for the cream job in
> torque shows an exec_host:
>
> svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
> Job Id: *2409813*.svr016.gla.scotgrid.ac.uk
> Job_Name = cream_520507277
> Job_Owner = [log in to unmask]
> job_state = W
> queue = q30m
> server = svr016.gla.scotgrid.ac.uk
> Checkpoint = u
> ctime = Fri Feb 20 14:36:34 2009
> Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
> * exec_host = node309/3*
> Execution_Time = Fri Feb 20 17:22:52 2009
> Hold_Types = n
> Join_Path = n
>
> But if you look at pbsnodes for that node, *node309* and grep for the cream
> job id, *2409813* it returns nothing.
>
> svr016:~# *pbsnodes node309 | grep 2409813*
>
> Then looking closer at node309 you see this job: 3/*2410148*.
> svr016.gla.scotgrid.ac.uk which is not the cream job
>
> svr016:~# pbsnodes node309
> node309
> state = free
> np = 8
> properties = lcgpro
> ntype = cluster
> jobs = 0/2408249.svr016.gla.scotgrid.ac.uk, 1/
> 2376195.svr016.gla.scotgrid.ac.uk, 2/2408647.svr016.gla.scotgrid.ac.uk, 3/*
> 2410148*.svr016.gla.scotgrid.ac.uk, 4/2408285.svr016.gla.scotgrid.ac.uk, 5/
> 2371288.svr016.gla.scotgrid.ac.uk, 6/2408028.svr016.gla.scotgrid.ac.uk
> status = opsys=linux,uname=Linux node309.beowulf.cluster
> 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008 x86_64,sessions=32057
> 2151 2589 24011 19802 29533
> 27211,nsessions=7,nusers=4,idletime=955188,totmem=5952308kb,availmem=5594108kb,physmem=16438780kb,ncpus=8,loadave=7.11,netload=4294967294,state=free,jobs=
> 2371288.svr016.gla.scotgrid.ac.uk 2376195.svr016.gla.scotgrid.ac.uk
> 2408028.svr016.gla.scotgrid.ac.uk 2408249.svr016.gla.scotgrid.ac.uk
> 2408647.svr016.gla.scotgrid.ac.uk 2408285.svr016.gla.scotgrid.ac.uk
> 2410148.svr016.gla.scotgrid.ac.uk,rectime=1235149229
>
> Then a quick qstat of *2410148* shows that its actually a totally different
> job.
>
> svr016:~# qstat -f *2410148*
> Job Id: 2410148.svr016.gla.scotgrid.ac.uk
> Job_Name = STDIN
> Job_Owner = [log in to unmask]
> resources_used.cput = 00:04:14
> resources_used.mem = 468532kb
> resources_used.vmem = 2307700kb
> resources_used.walltime = 00:05:51
> job_state = R
> queue = q30m
> server = svr016.gla.scotgrid.ac.uk
> Checkpoint = u
>
> Perhaps the wait status in Torque is saying that its waiting for 2410148 to
> finish before running the cream job - not sure.
> I just seems strange. The cream job has now been on the cluster for 2.5
> hours and have been moved through various exec_hosts but nothing is
> happening in terms of actually running.
>
> Cheers,
>
> Dug
>
>
> 2009/2/20 Massimo Sgaravatto - INFN Padova <[log in to unmask]>
>
> > If a job has been submitted to torque but it is not running, it is correct
> > that CREAM reports that the job is in IDLE status.
> > So why did you say "On deeper investigation it looks to me like torque or
> > cream
> > thinks the job is actually running on a job slot" ?
> >
> >
> > Then we have to understand why the job doesn't want to run ...
> >
> >
> > Cheers, Massimo
> >
> > On Fri, 20 Feb 2009, Douglas McNab wrote:
> >
> > > Hi Massimo,
> > >
> > > What I have seen is that the job looks to have been submitted to torque,
> > > moves from queued to a waiting state:
> > >
> > > svr016:/var/spool/pbs/server_logs# qstat | grep dteam083
> > > 2409813.svr016 cream_520507277 dteam083 0 Q
> > q30m
> > >
> > > svr016:/var/spool/pbs/server_logs# qstat | grep dteam083
> > > 2409813.svr016 cream_520507277 dteam083 0 W
> > q30m
> > >
> > > and then sits there. It looks from qtstat'ing the job that it try to
> > > schedule it to a worker node, sets the exec_host but the job never runs.
> > >
> > > In terms of the cream ce:
> > >
> > > -bash-3.00$ glite-ce-job-status
> > > https://dev011.gla.scotgrid.ac.uk:8443/CREAM520507277
> > > 2009-02-20 16:27:01,545 WARN - No configuration file suitable for
> > loading.
> > > Using built-in configuration
> > >
> > > ****** JobID=[https://dev011.gla.scotgrid.ac.uk:8443/CREAM520507277]
> > > Status = [IDLE]
> > >
> > > It also looks from tracejob that it keeps rescheduling the job for some
> > > reason.
> > >
> > > 02/20/2009 15:19:37 S Job Modified at request of
> > > [log in to unmask]
> > > 02/20/2009 15:19:37 S Job Run at request of
> > > [log in to unmask]
> > > 02/20/2009 15:50:24 S Job Modified at request of
> > > [log in to unmask]
> > > 02/20/2009 15:50:24 S Job Run at request of
> > > [log in to unmask]
> > > 02/20/2009 16:21:55 S Job Modified at request of
> > > [log in to unmask]
> > > 02/20/2009 16:21:55 S Job Run at request of
> > > [log in to unmask]
> > >
> > > There is no mention of "unable to run job" in the logs. It looks like it
> > > never actually gets that far.
> > > What does the BLParser actually do?
> > >
> > > Cheers,
> > >
> > > Dug
> > >
> > > 2009/2/20 Massimo Sgaravatto - INFN Padova <
> > [log in to unmask]>
> > >
> > > > Hi Douglas
> > > >
> > > > Are you saying that the job is submitted correctly on torque, but it
> > > > doesn't run (for whatever reason), and CREAM instead reports that the
> > job
> > > > is running ?
> > > >
> > > > This looks like bug
> > > >
> > > > https://savannah.cern.ch/bugs/index.php?45717
> > > >
> > > >
> > > > Can you check if the pbs log files there is something like "unable to
> > run
> > > > job" for that job ?
> > > >
> > > > Cheers, Massimo
> > > >
> > > >
> > > > On Fri, 20 Feb 2009, Douglas McNab wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am testing a cream ce that I have set up on a scotgrid dev machine.
> > > > > The current setup is cream ce and torque/maui on different hosts with
> > the
> > > > > logs mounted via NFS on the CE.
> > > > >
> > > > > The job gets submitted successfully and is traceable in torque.
> > However,
> > > > it
> > > > > moves from Q to W and waits forever. On deeper investigation it
> > looks to
> > > > me
> > > > > like torque or cream thinks the job is actually running on a job slot
> > > > where
> > > > > another totally different job is running. When running pbsnodes for
> > the
> > > > > node id and grepping for the cream job id - nothing is returned.
> > Then
> > > > > tailing the torque logs there is another dteam job with the same
> > > > exec_host
> > > > > as the cream job. This leads me to thinking that I may not have set
> > up
> > > > the
> > > > > log parsing correctly on the ce or something is getting thoroughly
> > > > confused.
> > > > >
> > > > > The outputs from various commands to come to this conclusion are
> > listed
> > > > > below. Any thoughts on this would be greatly appreciated.
> > > > >
> > > > > svr016:/var/spool/pbs/server_logs#* tracejob 2409813*
> > > > > /var/spool/pbs/mom_logs/20090220: No such file or directory
> > > > > /var/spool/pbs/sched_logs/20090220: No such file or directory
> > > > >
> > > > > Job: 2409813.svr016.gla.scotgrid.ac.uk
> > > > >
> > > > > 02/20/2009 14:36:34 S enqueuing into q30m, state 1 hop 1
> > > > > 02/20/2009 14:36:34 S Job Queued at request of
> > > > > [log in to unmask], owner =
> > > > > [log in to unmask], job name =
> > > > > cream_520507277, queue = q30m
> > > > > 02/20/2009 14:36:34 A queue=q30m
> > > > > 02/20/2009 15:19:37 S Job Modified at request of
> > > > > [log in to unmask]
> > > > > 02/20/2009 15:19:37 S Job Run at request of
> > > > > [log in to unmask]
> > > > >
> > > > >
> > > > > svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
> > > > > Job Id: *2409813.svr016.gla.scotgrid.ac.uk*
> > > > > Job_Name = cream_520507277
> > > > > Job_Owner = [log in to unmask]
> > > > > job_state = W
> > > > > queue = q30m
> > > > > server = svr016.gla.scotgrid.ac.uk
> > > > > Checkpoint = u
> > > > > ctime = Fri Feb 20 14:36:34 2009
> > > > > Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
> > > > > *exec_host = node182/2*
> > > > > Execution_Time = Fri Feb 20 15:49:41 2009
> > > > > ......
> > > > >
> > > > > svr016:~# *pbsnodes node182 | grep 2409813*
> > > > >
> > > > > svr016:~# *pbsnodes node182*
> > > > > node182
> > > > > state = job-exclusive
> > > > > np = 8
> > > > > properties = lcgpro
> > > > > ntype = cluster
> > > > > jobs = 0/2406819.svr016.gla.scotgrid.ac.uk, 1/
> > > > > 2409176.svr016.gla.scotgrid.ac.uk, 2/
> > 2409262.svr016.gla.scotgrid.ac.uk,
> > > > 3/
> > > > > 2340154.svr016.gla.scotgrid.ac.uk, 4/
> > 2354251.svr016.gla.scotgrid.ac.uk,
> > > > 5/
> > > > > 2407443.svr016.gla.scotgrid.ac.uk, 6/
> > 2408238.svr016.gla.scotgrid.ac.uk,
> > > > 7/
> > > > > 2407591.svr016.gla.scotgrid.ac.uk
> > > > > status = opsys=linux,uname=Linux node182.beowulf.cluster
> > > > > 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008
> > > > x86_64,sessions=30091
> > > > > 1175 1856 9228 11111 20888 22885
> > > > >
> > > >
> > 30853,nsessions=8,nusers=4,idletime=3709813,totmem=5952308kb,availmem=2123760kb,physmem=16438780kb,ncpus=8,loadave=8.04,netload=4294967294,state=free,jobs=
> > > > > 2340154.svr016.gla.scotgrid.ac.uk 2354251.svr016.gla.scotgrid.ac.uk
> > > > > 2406819.svr016.gla.scotgrid.ac.uk 2407443.svr016.gla.scotgrid.ac.uk
> > > > > 2407591.svr016.gla.scotgrid.ac.uk 2408238.svr016.gla.scotgrid.ac.uk
> > > > > 2409176.svr016.gla.scotgrid.ac.uk *2409262.svr016.gla.scotgrid.ac.uk
> > > > > ,rectime=1235143422*
> > > > >
> > > > > 02/20/2009 15:20:23;S;*2409262.svr016.gla.scotgrid.ac.uk
> > *;user=dteam166
> > > > > group=dteam jobname=STDIN queue=q3d ctime=1235130339 qtime=1235130339
> > > > > etime=1235130339 start=1235143223
> > > > > [log in to unmask] *exec_host=node182/2
> > > > > *Resource_List.cput=72:00:00 Resource_List.neednodes=1
> > > > > Resource_List.nodect=1 Resource_List.nodes=1
> > > > Resource_List.walltime=72:00:00
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Dug
> > > > >
> > > > >
> > > >
> > > > --
> > > > \\\|///
> > > > \\ ~ ~ //
> > > > (/ @ @ /)
> > > > -------oOOo-(_)-oOOo----------------------------------
> > > > Massimo Sgaravatto
> > > > INFN Sezione di Padova
> > > > Via Marzolo, 8
> > > > 35131 Padova - Italy
> > > > Tel: ++39 0498277047 Fax: ++39 0498277102
> > > > oooO E-mail: massimo.sgaravatto [at] pd.infn.it
> > > > ( ) Oooo Home page: http://www.pd.infn.it/~sgaravat<http://www.pd.infn.it/%7Esgaravat>
> > <http://www.pd.infn.it/%7Esgaravat>
> > > > --------\ (----( )----------------------------------
> > > > \_) ) /
> > > > (_/
> > > >
> > >
> > >
> > >
> > >
> >
> > --
> > \\\|///
> > \\ ~ ~ //
> > (/ @ @ /)
> > -------oOOo-(_)-oOOo----------------------------------
> > Massimo Sgaravatto
> > INFN Sezione di Padova
> > Via Marzolo, 8
> > 35131 Padova - Italy
> > Tel: ++39 0498277047 Fax: ++39 0498277102
> > oooO E-mail: massimo.sgaravatto [at] pd.infn.it
> > ( ) Oooo Home page: http://www.pd.infn.it/~sgaravat<http://www.pd.infn.it/%7Esgaravat>
> > --------\ (----( )----------------------------------
> > \_) ) /
> > (_/
> >
>
>
>
>
--
\\\|///
\\ ~ ~ //
(/ @ @ /)
-------oOOo-(_)-oOOo----------------------------------
Massimo Sgaravatto
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy
Tel: ++39 0498277047 Fax: ++39 0498277102
oooO E-mail: massimo.sgaravatto [at] pd.infn.it
( ) Oooo Home page: http://www.pd.infn.it/~sgaravat
--------\ (----( )----------------------------------
\_) ) /
(_/
|