Hi all,
We have occassionally been getting a these message from grid jobs that
were cancelled not long after starting running in PBS.
The output (either stderr or stdout or both) left in
/var/spool/pbs/undelivered is invariably a 0-sized file.
Has anyone encountered this feature in PBS before: when the jobs produce
either no, or a zero sized stdout or stderr file, PBS fails to copy it
back to the server?
cheers,
Owen.
ps. Yes, we have checked password free ssh is working between the nodes
and the server!
-------- Original Message --------
PBS Job Id: 1180.gw39.hep.ph.ic.ac.uk
Job Name: STDIN
File stage in failed, see below.
Job will be retried later, please investigate and correct problem.
Post job file processing error; job 1180.gw39.hep.ph.ic.ac.uk on host
gw33.hep.ph.ic.ac.uk/0
Unable to copy file 1180.gw39.h.OU to
gw39.hep.ph.ic.ac.uk:/home/dteam011/.lcgjm/globus-cache-export.6d24ZW/batch.out
>>> error from copy
gw39.hep.ph.ic.ac.uk: Connection refused
xport.6d24ZW/batch.out: No such file or directory
>>> end error output
Output retained on that host in: /var/spool/pbs/undelivered/1180.gw39.h.OU
Unable to copy file 1180.gw39.h.ER to
gw39.hep.ph.ic.ac.uk:/home/dteam011/.lcgjm/globus-cache-export.6d24ZW/batch.err
>>> error from copy
gw39.hep.ph.ic.ac.uk: Connection refused
xport.6d24ZW/batch.err: No such file or directory
>>> end error output
Output retained on that host in: /var/spool/pbs/undelivered/1180.gw39.h.ER
--
=======================================================
Dr O J E Maroney # London Tier 2 Technical Co-ordinator
Tel. (+44)20 759 47802
Imperial College London
High Energy Physics Department
The Blackett Laboratory
Prince Consort Road, London, SW7 2BW
====================================
|