On Thu, May 05, 2005 at 03:11:55PM +0100 or thereabouts, Mark Nelson wrote:
> On Thu, 5 May 2005 15:05:11 +0100
> owen maroney <[log in to unmask]> wrote:
> Hello Owen
>
>
> I've seen that at Durham, I spoke to Steve Traylen about this, I think
> he said it was a race condition between Globus and PBS, but I could be
> wrong, it would be best to check with him.
The GridMonitor thing that is fired up on the CE by the job manager
sits there and polls for when the job has finished.
If it notices the job has finished it clears up and removes the
working directory from where qsub was run. PBS is a little bit odd
in that it reports the job is finished before the output is copied
back to the working directory from where the job was submitted.
I think also if the user deletes the job, (have a look in the pbs_server log
files to check) then the job is `qdel` and then the directory is deleted
resulting in the same situation.
So it is harmless but I do agree it is pretty annoying. Occasionally
the files are not empty as well when for instance the job has been
deleted due to hitting a wall clock limit. In this case a sensible message
appears in this file.
I have mentioned it D.Smith who is aware. Not sure if a ticket has ever
been put in about it. He is certainly aware that in the case of qdel
there should at least be something to check that job is finished as far
as PBS is concerned. This would catch most of them.
Steve
>
> Mark.
> > Hi all,
> >
> > We have occassionally been getting a these message from grid jobs that
> > were cancelled not long after starting running in PBS.
> >
> > The output (either stderr or stdout or both) left in
> > /var/spool/pbs/undelivered is invariably a 0-sized file.
> >
> > Has anyone encountered this feature in PBS before: when the jobs
> > produce either no, or a zero sized stdout or stderr file, PBS fails to
> > copy it back to the server?
> >
> > cheers,
> > Owen.
> > ps. Yes, we have checked password free ssh is working between the
> > nodes and the server!
> >
> > -------- Original Message --------
> >
> > PBS Job Id: 1180.gw39.hep.ph.ic.ac.uk
> > Job Name: STDIN
> > File stage in failed, see below.
> > Job will be retried later, please investigate and correct problem.
> > Post job file processing error; job 1180.gw39.hep.ph.ic.ac.uk on host
> > gw33.hep.ph.ic.ac.uk/0
> >
> > Unable to copy file 1180.gw39.h.OU to
> > gw39.hep.ph.ic.ac.uk:/home/dteam011/.lcgjm/globus-cache-export.6d24ZW
> > /batch.out
> > >>> error from copy
> > gw39.hep.ph.ic.ac.uk: Connection refused
> > xport.6d24ZW/batch.out: No such file or directory
> > >>> end error output
> > Output retained on that host in:
> > /var/spool/pbs/undelivered/1180.gw39.h.OU
> >
> > Unable to copy file 1180.gw39.h.ER to
> > gw39.hep.ph.ic.ac.uk:/home/dteam011/.lcgjm/globus-cache-export.6d24ZW
> > /batch.err
> > >>> error from copy
> > gw39.hep.ph.ic.ac.uk: Connection refused
> > xport.6d24ZW/batch.err: No such file or directory
> > >>> end error output
> > Output retained on that host in:
> > /var/spool/pbs/undelivered/1180.gw39.h.ER
> >
> >
> > --
> > =======================================================
> > Dr O J E Maroney # London Tier 2 Technical Co-ordinator
> >
> > Tel. (+44)20 759 47802
> >
> > Imperial College London
> > High Energy Physics Department
> > The Blackett Laboratory
> > Prince Consort Road, London, SW7 2BW
> > ====================================
> >
>
>
>
> -------------------------------------------------------------
> Mark Nelson - [log in to unmask]
>
> IPPP, Department of Physics, University of Durham,
> Science Laborartories, South Road, Durham, DH1 3LE
> Office: +44 (0)191 334 3811, Direct Dial: +44 (0)191 334 3653
>
> This mail is for the addressee only
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|