Hi All,
OK, I have prodded our CE.
In the absence of any documentation on where the various bits are kept
(hint!),
I used the big stick approach. This didn't have much affect. It appears
however that rebuilding /etc/ssh/shosts.equit and /etc/ssh/ssh_known_hosts
has had an effect though this could be just coincidence.
Jobs are now running through.
Now the question is, why did the CE stop working since late-afternoon
yesterday
when the only local change was a reinstall of our RB? At that point, local
edg jobs from me worked OK though outbound remote jobs fail for the
previously
documented reasons.
Martin.
--
-------------------------------------------------------
Martin Bly | +44 1235 446981 | [log in to unmask]
Systems Admin, Tier 1/A Service, RAL PPD CSG
-------------------------------------------------------
> -----Original Message-----
> From: Emanuele LEONARDI [mailto:[log in to unmask]]
> Sent: Friday, September 19, 2003 9:09 AM
> To: [log in to unmask]
> Subject: [LCG-ROLLOUT] RAL PBS system is hanging?
>
>
> As a few jobs I submitted to RAL ended up in a queued status
> for a very
> long time, I gave a look to the status of pbs on
> lcgce01.gridpp.rl.ac.uk. This is what I see:
>
> (leonardi@adc0014) ~/grid/test> globus-job-run lcgce01.gridpp.rl.ac.uk
> /usr/bin/pbsnodes -a
> lcg0001.gridpp.rl.ac.uk
> state = free
> np = 2
> speed = 0
> properties = lcgpro
> ntype = cluster
>
> lcg0002.gridpp.rl.ac.uk
> state = free
> np = 2
> speed = 0
> properties = lcgpro
> ntype = cluster
>
> lcg0003.gridpp.rl.ac.uk
> state = free
> np = 2
> speed = 0
> properties = lcgpro
> ntype = cluster
>
> lcg0004.gridpp.rl.ac.uk
> state = free
> np = 2
> speed = 0
> properties = lcgpro
> ntype = cluster
>
> lcg0005.gridpp.rl.ac.uk
> state = free
> np = 2
> speed = 0
> properties = lcgpro
> ntype = cluster
>
> (leonardi@adc0014) ~/grid/test> globus-job-run lcgce01.gridpp.rl.ac.uk
> /usr/bin/qstat
> Job id Name User Time Use S Queue
> ---------------- ---------------- ---------------- -------- - -----
> 376.lcgce01 STDIN alice001 0 Q
> infinite
> 377.lcgce01 STDIN alice001 0 Q
> long
> 378.lcgce01 STDIN dteam004 0 Q
> short
> 379.lcgce01 STDIN dteam004 0 Q
> short
> 380.lcgce01 STDIN dteam004 0 Q
> short
> 381.lcgce01 STDIN dteam004 0 Q
> short
> 382.lcgce01 STDIN dteam004 0 Q
> short
> 383.lcgce01 STDIN dteam004 0 Q
> short
> 384.lcgce01 STDIN dteam004 0 Q
> short
> 385.lcgce01 STDIN dteam004 0 Q
> short
> 386.lcgce01 STDIN dteam004 0 Q
> short
> 387.lcgce01 STDIN dteam004 0 Q
> short
> 388.lcgce01 STDIN dteam004 0 Q
> short
> 389.lcgce01 STDIN dteam004 0 Q
> short
> 390.lcgce01 STDIN dteam004 0 Q
> short
> 391.lcgce01 STDIN dteam003 0 Q
> short
>
> This means that, even if all WNs are free, all incoming jobs are just
> queued to pbs but they are not started.
>
> Can the RAL site managers give a look to the CE and see what's
> happening?
>
> Thanks, ciao
>
> Emanuele
>
> --
> /------------------- Emanuele Leonardi -------------------\
> | eMail: [log in to unmask] - Tel.: +41-22-7674066 |
> | IT division - Bat.31 2-012 - CERN - CH-1211 Geneva 23 |
> \---------------------------------------------------------/
>
|