Back to the original question (ahem). What we do here (admittedly
Torque but probably works) is:
1) stop maui
2) stop PBS
next steps only if the problem job is in 'running' state:
3) log on to the WN that is running this job
4) kill the job and restart PBS on the WN
now go back to the PBS server machine and
5) find the files corresponding to the job ... they are off in
/var/spool/pbs somewhere and have names like 184312.JC or somesuch
where the number is the same as the PBS id.
6) delete the files that have the same number as the bad PBS job
7) restart PBS
8) restart maui
This is assuming that you first went in to the WN as root and restarted
PBS there, and afterwards tried restarting PBS on the server. Those two
actions "revive" some jobs. The longer recipe is a last resort for
completely unrevivable jobs. It wipes all memory of the job from PBS
(except from the log files).
JT
On Mon, 2005-01-17 at 16:36, Maarten Litmaath wrote:
> Steve Traylen wrote:
>
> > On Mon, Jan 17, 2005 at 03:16:00PM -0000 or thereabouts, Burke, S (Stephen) wrote:
> >
> >>Maarten Litmaath [mailto:[log in to unmask]] said:
> >>
> >>>There is your problem. HP also have a firewall for
> >>>*outgoing* connections,
> >>>and your port range is outside their allowed range...
> >>
> >>Aha ... but how do you know what their range is?
> >>
> >> I guess Steve is reading this anyway, but where am I supposed to report
> >>something like this these days? (Although since RAL is the site, ROC and CIC
> >>that may make it easier :)
> >
> >
> > The problem is with there site, you can't do outbound blocks and use
> > globus.
>
> You cannot reliably use the current versions of Globus.
>
> That really is a misfeature/bug in Globus and the GridFTP standard,
> that will have to be addressed eventually; port ranges are not needed.
|