In principle, reinstalling the RB should not have any effect on the PBS
system whatsoever. Also, as the connection between CE and WNs does not
go through the firewall, I do not think any change there can affect the
pbs system.
I see that now all hanging jobs have disappeared but submitting a new
job got it in the "Q"ueued status again.
You may want to try and restart all pbs services on the CE (pbs_server,
pbs_sched) and then submit a job from one of the pool accounts on the CE
(log on the node as root, su to dteam001, create a small script called
test.job, and then qsub it to pbs). If even this does not work, then you
may try looking for hints on the pbs server log files in
/var/spool/pbs/server_logs directory or on the WNs in
/var/spool/pbs/mom_logs.
Cheers
Emanuele
"Bly, MJ (Martin)" wrote:
>
> Hi All,
>
> OK, I have prodded our CE.
>
> In the absence of any documentation on where the various bits are kept
> (hint!),
> I used the big stick approach. This didn't have much affect. It appears
> however that rebuilding /etc/ssh/shosts.equit and /etc/ssh/ssh_known_hosts
> has had an effect though this could be just coincidence.
>
> Jobs are now running through.
>
> Now the question is, why did the CE stop working since late-afternoon
> yesterday
> when the only local change was a reinstall of our RB? At that point, local
>
> edg jobs from me worked OK though outbound remote jobs fail for the
> previously
> documented reasons.
>
> Martin.
> --
> -------------------------------------------------------
> Martin Bly | +44 1235 446981 | [log in to unmask]
> Systems Admin, Tier 1/A Service, RAL PPD CSG
> -------------------------------------------------------
>
> > -----Original Message-----
> > From: Emanuele LEONARDI [mailto:[log in to unmask]]
> > Sent: Friday, September 19, 2003 9:09 AM
> > To: [log in to unmask]
> > Subject: [LCG-ROLLOUT] RAL PBS system is hanging?
> >
> >
> > As a few jobs I submitted to RAL ended up in a queued status
> > for a very
> > long time, I gave a look to the status of pbs on
> > lcgce01.gridpp.rl.ac.uk. This is what I see:
> >
> > (leonardi@adc0014) ~/grid/test> globus-job-run lcgce01.gridpp.rl.ac.uk
> > /usr/bin/pbsnodes -a
> > lcg0001.gridpp.rl.ac.uk
> > state = free
> > np = 2
> > speed = 0
> > properties = lcgpro
> > ntype = cluster
> >
> > lcg0002.gridpp.rl.ac.uk
> > state = free
> > np = 2
> > speed = 0
> > properties = lcgpro
> > ntype = cluster
> >
> > lcg0003.gridpp.rl.ac.uk
> > state = free
> > np = 2
> > speed = 0
> > properties = lcgpro
> > ntype = cluster
> >
> > lcg0004.gridpp.rl.ac.uk
> > state = free
> > np = 2
> > speed = 0
> > properties = lcgpro
> > ntype = cluster
> >
> > lcg0005.gridpp.rl.ac.uk
> > state = free
> > np = 2
> > speed = 0
> > properties = lcgpro
> > ntype = cluster
> >
> > (leonardi@adc0014) ~/grid/test> globus-job-run lcgce01.gridpp.rl.ac.uk
> > /usr/bin/qstat
> > Job id Name User Time Use S Queue
> > ---------------- ---------------- ---------------- -------- - -----
> > 376.lcgce01 STDIN alice001 0 Q
> > infinite
> > 377.lcgce01 STDIN alice001 0 Q
> > long
> > 378.lcgce01 STDIN dteam004 0 Q
> > short
> > 379.lcgce01 STDIN dteam004 0 Q
> > short
> > 380.lcgce01 STDIN dteam004 0 Q
> > short
> > 381.lcgce01 STDIN dteam004 0 Q
> > short
> > 382.lcgce01 STDIN dteam004 0 Q
> > short
> > 383.lcgce01 STDIN dteam004 0 Q
> > short
> > 384.lcgce01 STDIN dteam004 0 Q
> > short
> > 385.lcgce01 STDIN dteam004 0 Q
> > short
> > 386.lcgce01 STDIN dteam004 0 Q
> > short
> > 387.lcgce01 STDIN dteam004 0 Q
> > short
> > 388.lcgce01 STDIN dteam004 0 Q
> > short
> > 389.lcgce01 STDIN dteam004 0 Q
> > short
> > 390.lcgce01 STDIN dteam004 0 Q
> > short
> > 391.lcgce01 STDIN dteam003 0 Q
> > short
> >
> > This means that, even if all WNs are free, all incoming jobs are just
> > queued to pbs but they are not started.
> >
> > Can the RAL site managers give a look to the CE and see what's
> > happening?
> >
> > Thanks, ciao
> >
> > Emanuele
> >
> > --
> > /------------------- Emanuele Leonardi -------------------\
> > | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> > | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> > \---------------------------------------------------------/
> >
--
/------------------- Emanuele Leonardi -------------------\
| eMail: [log in to unmask] - Tel.: +41-22-7674066 |
| IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
\---------------------------------------------------------/
|