Hi Ricardo
On Mon, 23 Aug 2004, Ricardo Graciani wrote:
> Could you clarify if the jobs have been restarted or were kept
> running on the WN without problems?
I believe the jobs continued running, but it is not clear that they will
all finish successfully. I do see a few job reporting errors from the PBS
job manager with problems copying files back to the CE, for e.g.:
PBS Job Id: 20073.bigmac-lcg-ce.physics.utoronto.ca
Job Name: STDIN
Post job file processing error; job
20073.bigmac-lcg-ce.physics.utoronto.ca on
host wn016/0
Unable to copy file 20073.bigma.OU to
bigmac-lcg-ce.physics.utoronto.ca:/home/lhcb002/.lcgjm/globus-cache-export.hVXe7L/batch.out
>>> error from copy
bigmac-lcg-ce.physics.utoronto.ca: Connection refused
atch.out: No such file or directory
>>> end error output
Output retained on that host in: /var/spool/pbs/undelivered/20073.bigma.OU
Unable to copy file 20073.bigma.ER to
bigmac-lcg-ce.physics.utoronto.ca:/home/lhcb002/.lcgjm/globus-cache-export.hVXe7L/batch.err
>>> error from copy
bigmac-lcg-ce.physics.utoronto.ca: Connection refused
atch.err: No such file or directory
>>> end error output
Output retained on that host in: /var/spool/pbs/undelivered/20073.bigma.ER
> In the meantime there were a number of jobs assigned to Toronto
> that were not schedule (they were stuck in Ready status at the RB). I
> assume they have been now submitted to the CE but will need to check.
I have seen a few new jobs come in this morning, but they are stuck in
Wait states with the following message from PBS:
20465.bigmac-lcg STDIN lhcb002 0 W infinite
20466.bigmac-lcg STDIN lhcb002 0 W infinite
20467.bigmac-lcg STDIN lhcb002 0 W infinite
20468.bigmac-lcg STDIN lhcb002 0 W infinite
PBS Job Id: 20465.bigmac-lcg-ce.physics.utoronto.ca
Job Name: STDIN
File stage in failed, see below.
Job will be retried later, please investigate and correct problem.
Unable to copy file globus-cache-export.8T3ERw.gpg from
bigmac-lcg-ce.physics.utoronto.ca:/home/lhcb002/.lcgjm/globus-cache-export.8T3ERw/globus-cache-export.8T3ERw.gpg
>>> error from copy
bigmac-lcg-ce.physics.utoronto.ca: Connection refused
lobus-cache-export.8T3ERw.gpg: No such file or directory
>>> end error output
I am trying to see if there is a configuration problem. I do believe
though, that short Atlas jobs have come in and run successfully since the
reboot.
Leslie
|