Hi there,
we have mysterious problem.
We run an extra torque server and a lcg-CE.
After an update of the torque server (operating system update), suddenly
all jobs are submitted several times to the system.
A job arrives at the lcg-CE and is passed to the torque server (other
machine). Torque accepts the job and runs it.
The logs both on the CE and torque look normal. But after about half a
minute the same job (same Grid ID) is submitted again to the lcg-CE. And
again and again. The same job is submitted 9 times.
The jobs then fail when trying to copy there output from the WN to the CE:
from WN /var/log/messages:
Nov 3 17:36:03 r1i1n15 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
/var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU
[log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout'
failed with status=1, giving up after 4 attempts
Nov 3 17:36:03 r1i1n15 pbs_mom: req_cpyfile, Unable to copy file
/var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU to
[log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout
The reason is that the directory
/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/
on the CE does not exist anymore.
Does anybody have an idea where to look?
Cheers,
Ralph
|