Hallo Ralph,
> we have mysterious problem.
> We run an extra torque server and a lcg-CE.
> After an update of the torque server (operating system update), suddenly
> all jobs are submitted several times to the system.
>
> A job arrives at the lcg-CE and is passed to the torque server (other
> machine). Torque accepts the job and runs it.
> The logs both on the CE and torque look normal. But after about half a
> minute the same job (same Grid ID) is submitted again to the lcg-CE. And
> again and again. The same job is submitted 9 times.
>
> The jobs then fail when trying to copy there output from the WN to the CE:
> from WN /var/log/messages:
> Nov 3 17:36:03 r1i1n15 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
> /var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU
> [log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout'
> failed with status=1, giving up after 4 attempts
> Nov 3 17:36:03 r1i1n15 pbs_mom: req_cpyfile, Unable to copy file
> /var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU to
> [log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout
>
> The reason is that the directory
> /home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/
> on the CE does not exist anymore.
A few months ago the CIEMAT site reported a similar problem.
After a lot of debugging effort the LCG-CE was rebooted and
the problem has not returned since: did you try a reboot?
We never understood the cause, but we did provide a version
of the "lcg" job managers that has a protection against
multiple submissions of the same job:
https://savannah.cern.ch/patch/?4331
But your CE is using the standard Globus job managers...
|